Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

11/21/2019
by   David Harwath, et al.
0

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.

READ FULL TEXT
06/15/2020

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

We investigate the effect of introducing phone, syllable, or word bounda...
02/21/2019

Towards Visually Grounded Sub-Word Speech Unit Discovery

In this paper, we investigate the manner in which interpretable sub-word...
09/09/2019

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listenin...
01/25/2017

Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...
03/14/2022

Modelling word learning and recognition using visually grounded speech

Background: Computational models of speech recognition often assume that...
02/08/2019

Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually g...
08/12/2021

Mispronunciation Detection and Correction via Discrete Acoustic Units

Computer-Assisted Pronunciation Training (CAPT) plays an important role ...