Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

02/07/2022
by   Puyuan Peng, et al.
0

In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.

READ FULL TEXT
research
09/16/2021

Fast-Slow Transformer for Visually Grounding Speech

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
research
05/19/2023

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode

In this paper, we show that representations capturing syllabic units eme...
research
11/23/2020

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the lear...
research
06/05/2023

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Speech representation learning with self-supervised algorithms has resul...
research
09/30/2022

On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Self-supervised representations have been extensively studied for discri...
research
02/23/2023

ProsAudit, a prosodic benchmark for self-supervised speech models

We present ProsAudit, a benchmark in English to assess structural prosod...
research
12/15/2021

Oracle Linguistic Graphs Complement a Pretrained Transformer Language Model: A Cross-formalism Comparison

We examine the extent to which, in principle, linguistic graph represent...

Please sign up or login with your details

Forgot password? Click here to reset