What do self-supervised speech models know about words?

06/30/2023
by   Ankita Pasad, et al.
0

Many self-supervised speech models (S3Ms) have been introduced over the last few years, producing performance and data efficiency improvements for a variety of speech tasks. Evidence is emerging that different S3Ms encode linguistic information in different layers, and also that some S3Ms appear to learn phone-like sub-word units. However, the extent to which these models capture larger linguistic units, such as words, and where word-related information is encoded, remains unclear. In this study, we conduct several analyses of word segment representations extracted from different layers of three S3Ms: wav2vec2, HuBERT, and WavLM. We employ canonical correlation analysis (CCA), a lightweight analysis tool, to measure the similarity between these representations and word-level linguistic properties. We find that the maximal word-level linguistic content tends to be found in intermediate model layers, while some lower-level information like pronunciation is also retained in higher layers of HuBERT and WavLM. Syntactic and semantic word attributes have similar layer-wise behavior. We also find that, for all of the models tested, word identity information is concentrated near the center of each word segment. We then test the layer-wise performance of the same models, when used directly with no additional learned parameters, on several tasks: acoustic word discrimination, word segmentation, and semantic sentence similarity. We find similar layer-wise trends in performance, and furthermore, find that when using the best-performing layer of HuBERT or WavLM, it is possible to achieve performance on word segmentation and sentence similarity that rivals more complex existing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2022

Comparative layer-wise analysis of self-supervised speech models

Many self-supervised speech models, varying in their pre-training object...
research
06/09/2023

Probing self-supervised speech models for phonetic and phonemic information: a case study in aspiration

Textless self-supervised speech models have grown in capabilities in rec...
research
11/21/2019

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

In this paper, we present a method for learning discrete linguistic unit...
research
05/03/2020

Similarity Analysis of Contextual Word Representation Models

This paper investigates contextual word representation models from the l...
research
09/29/2021

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation

Decades of research has studied how language learning infants learn to d...
research
12/14/2020

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

We investigate segmenting and clustering speech into low-bitrate phone-l...
research
06/24/2019

A computational model of early language acquisition from audiovisual experiences of young infants

Earlier research has suggested that human infants might use statistical ...

Please sign up or login with your details

Forgot password? Click here to reset