On the representation of speech and music

05/08/2019
by   David N. Levin, et al.
0

In most automatic speech recognition (ASR) systems, the audio signal is processed to produce a time series of sensor measurements (e.g., filterbank outputs). This time series encodes semantic information in a speaker-dependent way. An earlier paper showed how to use the sequence of sensor measurements to derive an "inner" time series that is unaffected by any previous invertible transformation of the sensor measurements. The current paper considers two or more speakers, who mimic one another in the following sense: when they say the same words, they produce sensor states that are invertibly mapped onto one another. It follows that the inner time series of their utterances must be the same when they say the same words. In other words, the inner time series encodes their speech in a manner that is speaker-independent. Consequently, the ASR training process can be simplified by collecting and labelling the inner time series of the utterances of just one speaker, instead of training on the sensor time series of the utterances of a large variety of speakers. A similar argument suggests that the inner time series of music is instrument-independent. This is demonstrated in experiments on monophonic electronic music.

READ FULL TEXT
research
06/24/2023

An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing

Deaf or hard-of-hearing (DHH) speakers typically have atypical speech ca...
research
06/26/2019

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

In this paper, we propose a novel auxiliary loss function for target-spe...
research
10/08/2020

interface : Electronic Chamber Ensemble

This paper presents the interface developments and music of the duo "int...
research
05/05/2022

Speaker Recognition in the Wild

In this paper, we propose a pipeline to find the number of speakers, as ...
research
04/12/2023

Acoustic absement in detail: Quantifying acoustic differences across time-series representations of speech data

The speech signal is a consummate example of time-series data. The acous...
research
07/19/2004

Channel-Independent and Sensor-Independent Stimulus Representations

This paper shows how a machine, which observes stimuli through an unchar...

Please sign up or login with your details

Forgot password? Click here to reset