Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval

04/05/2021
by   Ramon Sanabria, et al.
0

Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice – both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8 show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2020

Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework

Previous research has proposed a machine speech chain to enable automati...
research
04/27/2022

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

Multimodal speech recognition aims to improve the performance of automat...
research
04/27/2023

Understanding Shared Speech-Text Representations

Recently, a number of approaches to train speech models by incorpo-ratin...
research
09/28/2022

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT)...
research
10/23/2019

Analyzing ASR pretraining for low-resource speech-to-text translation

Previous work has shown that for low-resource source languages, automati...
research
09/30/2021

SpliceOut: A Simple and Efficient Audio Augmentation Method

Time masking has become a de facto augmentation technique for speech and...
research
03/27/2023

Model Cascades for Efficient Image Search

Modern neural encoders offer unprecedented text-image retrieval (TIR) ac...

Please sign up or login with your details

Forgot password? Click here to reset