DeepAI AI Chat
Log In Sign Up

Visual Keyword Spotting with Attention

by   K R Prajwal, et al.

In this paper, we consider the task of spotting spoken keywords in silent video sequences – also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.


page 4

page 9

page 10

page 20


Towards visually prompted keyword localisation for zero-resource spoken languages

Imagine being able to show a system a visual depiction of a keyword and ...

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a w...

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...

Zero-shot keyword spotting for visual speech recognition in-the-wild

Visual keyword spotting (KWS) is the problem of estimating whether a tex...

A Multi-oriented Chinese Keyword Spotter Guided by Text Line Detection

Chinese keyword spotting is a challenging task as there is no visual bla...

Improving Controllability of Educational Question Generation by Keyword Provision

Question Generation (QG) receives increasing research attention in NLP c...

VRM-Phase I VKW system description of long-short video customizable keyword wakeup challenge

Keyword wakeup technology has always been a research hotspot in speech p...