Fast-Slow Transformer for Visually Grounding Speech

09/16/2021
by   Puyuan Peng, et al.
0

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

READ FULL TEXT
research
02/07/2022

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...
research
10/15/2021

Cascaded Fast and Slow Models for Efficient Semantic Code Search

The goal of natural language semantic code search is to retrieve a seman...
research
03/30/2021

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Our objective is language-based search of large-scale image and video da...
research
07/27/2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding wit...
research
06/26/2019

Sharing Attention Weights for Fast Transformer

Recently, the Transformer machine translation system has shown strong re...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...
research
04/27/2022

Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently ...

Please sign up or login with your details

Forgot password? Click here to reset