Towards Visually Grounded Sub-Word Speech Unit Discovery

02/21/2019
by   David Harwath, et al.
0

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.

READ FULL TEXT

page 3

page 4

research
09/18/2019

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

In this paper, we study how word-like units are represented and activate...
research
05/31/2020

Learning to Recognise Words using Visually Grounded Speech

We investigated word recognition in a Visually Grounded Speech model. Th...
research
06/01/2020

A Neural Network Model of Lexical Competition during Infant Spoken Word Recognition

Visual world studies show that upon hearing a word in a target-absent vi...
research
11/21/2019

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

In this paper, we present a method for learning discrete linguistic unit...
research
04/19/2021

Interpreting intermediate convolutional layers of CNNs trained on raw speech

This paper presents a technique to interpret and visualize intermediate ...
research
06/16/2023

Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody

This paper investigates the use of word surprisal, a measure of the pred...
research
11/20/2018

WEST: Word Encoded Sequence Transducers

Most of the parameters in large vocabulary models are used in embedding ...

Please sign up or login with your details

Forgot password? Click here to reset