Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

06/15/2020
by   William N. Havard, et al.
0

We investigate the effect of introducing phone, syllable, or word boundaries on the performance of a Model of Visually Grounded Speech and compare the results with a model that does not use any boundary information and with a model that uses random boundaries. We introduce a simple way to introduce such information in an RNN-based model and investigate which type of boundary enables a better mapping between an image and its spoken description. We also explore where, that is, at which level of the network's architecture such information should be introduced. We show that using a segmentation that results in syllable-like or word-like segments and that respects word boundaries are the most efficient. Also, we show that a linguistically informed subsampling is more efficient than a random subsampling. Finally, we show that using a hierarchical segmentation, by first using a phone segmentation and recomposing words from the phone units yields better results than either using a phone or word segmentation in isolation.

READ FULL TEXT
research
10/06/2022

Are word boundaries useful for unsupervised language learning?

Word or word-fragment based Language Models (LM) are typically preferred...
research
11/21/2019

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

In this paper, we present a method for learning discrete linguistic unit...
research
06/22/2022

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Finding word boundaries in continuous speech is challenging as there is ...
research
08/23/2019

VOP Detection for Read and Conversation Speech using CWT Coefficients and Phone Boundaries

In this paper, we propose a novel approach for accurate detection of the...
research
12/14/2020

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

We investigate segmenting and clustering speech into low-bitrate phone-l...
research
04/29/2020

Robust Phonetic Segmentation Using Spectral Transition measure for Non-Standard Recording Environments

Phone level localization of mis-articulation is a key requirement for an...
research
05/26/2021

Prosodic segmentation for parsing spoken dialogue

Parsing spoken dialogue poses unique difficulties, including disfluencie...

Please sign up or login with your details

Forgot password? Click here to reset