Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

by   David Harwath, et al.

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.


Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

We investigate the effect of introducing phone, syllable, or word bounda...

Towards Visually Grounded Sub-Word Speech Unit Discovery

In this paper, we investigate the manner in which interpretable sub-word...

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listenin...

Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...

Modelling word learning and recognition using visually grounded speech

Background: Computational models of speech recognition often assume that...

Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually g...

Mispronunciation Detection and Correction via Discrete Acoustic Units

Computer-Assisted Pronunciation Training (CAPT) plays an important role ...