Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

11/21/2019
by   David Harwath, et al.
0

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.

READ FULL TEXT
research
06/15/2020

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

We investigate the effect of introducing phone, syllable, or word bounda...
research
06/30/2023

What do self-supervised speech models know about words?

Many self-supervised speech models (S3Ms) have been introduced over the ...
research
02/21/2019

Towards Visually Grounded Sub-Word Speech Unit Discovery

In this paper, we investigate the manner in which interpretable sub-word...
research
09/09/2019

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listenin...
research
09/18/2019

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

In this paper, we study how word-like units are represented and activate...
research
02/08/2019

Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually g...
research
11/20/2018

WEST: Word Encoded Sequence Transducers

Most of the parameters in large vocabulary models are used in embedding ...

Please sign up or login with your details

Forgot password? Click here to reset