There is no accurate computational model of human speech perception that applies to real speech. Implemented speech perception models exist which take artificial phonetic or perceptual features as input and map them to recognized words McClelland and Elman (1986); Norris and McQueen (2008), use speech recognizers as a front-end to derive phonetic transcriptions Scharenborg et al. (2005), or work on raw speech waveforms for extremely artificial utterances only Elman and McClelland (2015). Yet, traditional automatic speech recognition systems directly analyze natural, recorded, continuous speech and decode it as a sequence of phonemes or words. We take the reverse engineering approach Dupoux (2018)
of concluding that the signal processing and machine learning tools underlying automatic speech recognition should thus provide a starting point for a model of human speech perception.
Little is known, however, about the exact nature of the difference between the behaviour of human beings and that of speech processing tools developed for an applied purpose. We propose the Perceptimatic English Benchmark (PEB), an experimental human data set documenting a basic profile of English phone discrimination which is amenable to comparisons with a wide range of models.111All stimuli, human experimental data, analysis and processing scripts, and model results, are available at the following permanent link: [MASKED FOR REVIEW].
We focus on a simple experiment for which typical speech recognition models could in principle give results comparable to humans, that of phone discrimination (typical speech recognition models are classifiers for sequences of phones). However, speech recognition models are trained on databases of continuous, natural speech, while typical experimental stimuli are individual phones, syllables, or words, read or synthesized in an effort to ensure that the phonetic properties being probed are audible. Such word-list type pronunciations, while clear to human listeners, are likely quite different from the training data of standard speech recognition models. Models applied to them would be faced with the often difficult task of generalizing to a novel speech style. We first propose a more conservative test: the Perceptimatic English Benchmark is thus constructed out of snippets from a corpus of read speech—ecological for typical models—tested as phone discrimination experiment items on English listeners.
To the degree made possible by the speech corpora from which the stimuli are extracted, we make the evaluation complete, in the sense that they test discrimination of as many pairs of phones as possible, while being controlled in several ways, notably in never comparing phones extracted from radically different phonetic contexts. To widen the benchmark, we also test French stimuli, which are unfamiliar both to human listeners and to models trained on English. Details are found in 2 Perceptimatic English Benchmark below.
In this paper, we use the PEB to evaluate seven models that apply to real speech. We find that several models are predictive of humans. Surprisingly, a multilingual model—which is not trained to recognize English phonemes—and a short-duration acoustic event model—which is not trained to recognize phonemes at all—are far more predictive than a standard speech recognizer. We argue that the speech recognizer is too good at English phone discrimination.
2 Perceptimatic English Benchmark
2.0.1 Experimental Task
We assess the perception of phones. We use an ABX discrimination task. Human participants hear three stimuli and are asked to identify which one of the first two stimuli (A or B) is more similar to the third (X). The experimenter always identifies a correct answer—in this case, by making A and B instances of two different phonetic categories, and X another example of one of the two. The accuracy of listeners’ responses to a given triplet (combination of specific stimuli into an A–B–X item) gives a measure of the discriminability of the categories to which A and B belong.
We construct triplets in which A, B, and X are each sequences of three consecutive phones extracted from running speech, where only the centre phone differs between A and B (for example, [seIk]–[soUk], [zfA]–[zpA]). We control the context in order to avoid mismatching different contextual allophones. We incorporate this context into the stimuli in order to avoid making the stimuli too short. The references, A and B, are uttered by the same speaker in order to avoid listeners’ responding on the basis of speaker differences, while the probe, X, is uttered by a different speaker, to encourage listeners to focus on the linguistic content rather than acoustic detail. The delay between A and B is 500 milliseconds, and between B and X 650 milliseconds, as pilot subjects reported having difficulty recalling the reference stimuli when the delays were exactly equal.
Both English and French stimuli are extracted from the subset of the LibriVox audio book collection used as evaluation stimuli in the Zero Resource Speech Challenge (see 7 Related work below). The set of centre phones used is drawn from the phonemic transcriptions. We exclude phones (or phones in certain neutralizing contexts) which we expected might be subject to a merger, or which were sufficiently marginal that the transcriptions were unlikely to be reliable. In total, the stimuli consist of 5202 triplets (2214 from English), making 461 distinct centre phone contrasts (212 English, 249 French), in a total of 201 distinct contexts (118 English, 83 French), with most phone comparisons appearing in three contexts each (a total of 47 English contrasts appear in either one, two, or four contexts). The speakers used (15 English, 18 French) have, in our assessment, pronunciations close to standard American English/Metropolitan French. Not all phone comparisons occur, nor do all phone comparisons occur in the same contexts, or with the same set of speakers: we (native English and French listeners) selected the stimuli by hand out of the very large set of constructible triplets to maximize the phonetic similarity of the probe’s centre phone to that of the correct answer, and minimize phonetic differences in the surrounding contexts. This is critical when extracting stimuli from natural speech: transcriptions are not always accurate, and a three-phone window is not sufficient to guarantee which of the many possible contextual variants each transcribed phone really corresponds to.222The full set of English centre phones included in at least one item is æ A b d D eI E f g h i I k l m n N oU p r s S t tS u U v w z. The full set of French phones included is [a Ã b d e E Ẽ f g i j k l m n o ø O Õ p K s S t u v w y z Z]. For the full list of pairs and contexts tested, see the online repository. Each set of three stimuli appears in four distinct items, corresponding to orders AB–A (that is, X is another instance of the three-phone sequence A), BA–B, AB–B, and BA–A.
2.0.3 Reference Data Collection
The data set includes 91 participants reporting English as the language to which they were primarily exposed before the age of eight. They performed the task on Amazon Mechanical Turk (US participants) with the LMEDS software Mahrt (2016) and were paid for participation.333kleinschmidt2015robust made a detailed comparison of data from an in-lab speech perception experiment with a Mechanical Turk replication and found a close correspondence between the results. We asked participants to use headphones, to do the task in a quiet environment, and to check the sound volume before the experiment began. 15 additional participants were tested but did not meet the language background requirements, and 65 were rejected for failing at least three out of twelve catch trials or not finishing the task.444The catch trials consisted of additional, highly distinct three-phone ABX stimuli, including several which required participants to distinguish cat from dog.
For testing, items were counterbalanced into lists of 190 triplets per participant, such that no participant was tested twice on the same contrast, and such that the combination of speakers was not predictive of the right answer. Each stimulus was tested three times, so that most contrasts were tested at least 36 times. Participants responded as to which of the two reference stimuli the probe corresponded to on a six-point scale, ranging from first stimuli for sure to second stimuli for sure
, with two intermediate degrees of certainty for each reference stimulus. The benchmark includes both these responses and a binarized version, taking into account the participant’s choice but not their reported certainty. Here we report only analysis of the binarized responses to avoid questions about how to model participants’ use of the scale (preliminary analyses on the scaled responses indicate that the results are qualitatively the same).
3 Generating Model Predictions
For each experimental stimulus, we suppose that we can apply a model to the audio file and extract that model’s representation of the stimulus (see below for examples). To predict human responses, we compute distances , between the probe and the correct matching stimulus, and , between the probe and the other reference stimulus, to generate a degree of correct discriminability . If , then the model treats the probe as being more similar to the correct than the incorrect answer. Our goal is to assess whether humans’ perceived similarity matches the model’s distances. Humans’ responses are stochastic, and need not use a threshold at the point of maximal perceived similarity. This leads us to use a binomial generalized linear model with an intercept parameter.
This is not the only possible linking hypothesis, but it is broadly applicable, and allows for a distance function to be selected that is appropriate to the type of representation being tested. All the models we consider in this paper yield representations of variable length (they output vector sequences—one vector per time frame—and the stimuli are not all of the same duration). Thus, we use distance functions based on dynamic time warping. Dynamic time warping takes two sequencesand as input, as well as a function for comparing pairs of sequence elements. It aligns and by matching the elements of one to the other so as to minimize the sum of for all matched elements . Each element of must to be matched with at least one element of , and alignments must respect temporal order. Here we calculate distances between stimuli and as:
For the models analyzed here, we take to be either a symmetrised Kullblack–Leibler divergence555We replace zero elements with a very small constant to avoid division by zero.
(for models that output probability vectors), or a cosine distance. Whereand are -dimensional vectors, they are defined as:
We apply the methods described in the previous section. Unless stated, we take to be the cosine distance (3).
4.0.1 Dirichlet Process Gaussian Mixture Model
We evaluate a Dirichlet process Gaussian mixture model (DPGMM) as proposed by chen2015parallel. Given a training set of speech recordings in a language, the model performs non-parametric Bayesian clustering on the entire database, treated as an unordered collection of instantaneous acoustic feature vectors (see4.0.5 Mel Filterbank Cepstral Coefficients
below). It models short-duration acoustic events. A fitted model consists of an optimal set of Gaussian distributions—typically several hundred. The model thus preserves fine-grained temporal and acoustic detail, while still modelling a specific language. It does not use phoneme labels. Passing over a new sample at a fixed analysis rate (in our case, analyzing 25 milliseconds of signal every 10 milliseconds), each instant of signal is mapped to a vector of posterior probabilities over the Gaussians in the model. We taketo be the symmetrized KL divergence (2). We apply the English model described in millet2019comparing, trained on 34 hours of English speech taken from the LibriVox dataset (no overlap with the stimuli or speakers in Perceptimatic).
4.0.2 Bottleneck features
We evaluate three models proposed in bottleneck. These bottleneck models are trained to label speech with phone states
. Phone states are temporal analysis units used by certain speech recognizers: each phone of the language is modelled as having (in the typical three-state model) a beginning, middle, and end state, each with different acoustic properties. The bottleneck models are trained on speech data labeled annotated with an attribution to phone states. They are neural networks trained to predict the phone state associated with a given instant of speech, on the basis of its acoustic features, accompanied by 310 ms of surrounding context. This model is thus optimized to predict a slightly more temporally fine-grained version of standard phoneme labels. “Bottleneck” refers to a hidden layer that has significantly lower dimension than the other layers. The features we use are the contents of this layer, for each instant of signal. We evaluateEnglish monophone, English triphone, and multilingual models.666Referred to by bottleneck as FisherMono, FisherTri, and BabelMulti. The English monophone model is optimized to predict states for English phonemes. The English triphone model is optimized to predict states for contextual allophones. The multilingual model is trained on data from seventeen phonetically diverse languages (not including English), optimized to label phoneme states in any of these languages (if the same sound belongs to different inventories, it is treated as distinct, for a total of 1032 possible phonemes).
DeepSpeech Hannun et al. (2014) is a neural automatic speech recognition model used in the Mozilla speech tools. The model uses bi-directional recurrent units, which integrate information both forwards and backwards in time, to predict text transcriptions (sequences of letters, not phones) from speech. We can examine the state of any of its several internal layers corresponding to any instant of signal. After scoring each layer on its performance on the (artificial) phone discrimination evaluation described in dunbar2017zero, we found that layer five was optimal. We thus analyze the outputs from that layer. The model has a training objective related to that of the English bottleneck models (predicting text), but the recurrent units allow it to model long distance temporal dependencies. We use Mozilla DeepSpeech 0.4.1777https://github.com/mozilla/DeepSpeech/releases/tag/v0.4.1, which is trained on the Fisher Cieri et al. (2004) and Switchboard Godfrey et al. (1992) telephone corpora and the LibriSpeech audio book corpus Panayotov et al. (2015). The model achieves an 8.26% word error rate on the LibriSpeech clean test evaluation.
4.0.4 Articulatory Reconstruction
To explore whether similarities at the level of articulation are more predictive of humans’ behaviour, we evaluate a neural articulatory reconstruction model Parrot et al. (2019), trained to reconstruct continuous electromagnetic articulography (EMA) coil position trajectories from speech recordings (tongue body, tongue tip, tongue dorsum, upper lip, lower lip, lower incisor). The model is trained on the EMA-IEEE corpus Tiede,Mark et al. (2017), approximately six hours of read English speech, paired with EMA recordings, from eight speakers.
4.0.5 Mel Filterbank Cepstral Coefficients
We use Kaldi Povey et al. (2011)
to extract 13 Mel filterbank cepstral coefficients (MFCC): one vector every 10 milliseconds, each analyzing 25 milliseconds of signal. These audio representations, used standardly as input to speech recognition, are the result of a low-resolution spectral analysis and a discrete cosine transformation. We add the first and second derivatives, for a total of 39 dimensions, and apply mean-variance normalization over a moving three-second window. This approach, like the multilingual bottleneck features, does not specifically model English; unlike that model, this is a fixed transformation, not tuned to any language, or indeed to speech at all.
5.0.1 Performance on the Experimental Task
We compute the mean accuracies888Scoring accuracy first by stimulus, then averaging by contrast, then overall. for each of the models, scoring stimuli as correct where .The results in Table 1 indicate that the models’ performance is generally better than the human listeners in the PEB. This implies that, to the extent that any of these models accurately captures listeners’ perceived discriminability, listeners’ behaviour on the task, unsurprisingly, cannot correspond to a hard decision at the optimal decision threshhold. The results also indicate, as expected, a small native language effect—a decrease in listeners’ discrimination accuracy for the non-English stimuli. Such an effect is also captured by all the models trained on English. We observe that some models show native language effects numerically much larger than human listeners, a point we return to below.
In order to see which model best predicts the human results,999Here we report results on both the English (native) and French (non-native) stimuli. In the interest of analyzing stimuli that are maximally ecological for the models tested, we also analyzed the results of the native-language perception task only. The results are qualitatively identical, so we omit them in the interest of space. The table is available in the online repository. we fit probit regression models with a coefficient for the discriminability score corresponding to the given model. The dependent variable is whether the trial response was correct (1: accurate, 0: inaccurate). To correct for effects that are not of interest, the models each also include a coefficient for whether the correct answer was A or B, a coefficient for the position of the trial in the experimental list, and a coefficient for participant.
We use differences in log likelihood for model comparison, obtaining confidence intervals by repeatedly drawing balanced subsamples (): for each stimulus, we draw three observations without replacement. The results, in Table 2, show that the most predictive approaches are short-term acoustic event modelling (DPGMM) and bottleneck phone state predictors, with the English monophone (phoneme) predictor model showing non-significantly poorer performance than the allophonic and multilingual ones.
The results of our English phone discrimination benchmark are best predicted by the DPGMM’s short-duration acoustic event modelling and the three bottleneck phone state classification models, consistent with millet2019comparing and nikamemoire. These do substantially better than generic audio features. Two of the bottleneck models are trained to predict English phoneme/allophone labels, but the multilingual model is not, which makes its performance all the more surprising. Neither is the DPGMM model, which, although trained on English, models 25 millisecond acoustic events into combinations of hundreds of detailed acoustic categories, and is thus much more fine-grained than typical phonetic annotation.
The articulatory reconstruction model is not very predictive of human behaviour. The likely reasons are simple. First, predicting articulatory parameters for novel speakers is difficult, and this model is far from state-of-the-art performance. Second, the model does not predict a complete set of articulators. It is thus unsurprising that, when scored on the experimental task, this model is worse than even the acoustic features.
The continuous speech recognizer (DeepSpeech) is also bad at predicting human behaviour, but, unlike the articulatory reconstruction, performs well on the experimental task. This model is different from the English bottleneck models in two ways. First, it is in principle capable of taking into account longer temporal dependencies than the finite 310 ms window used by the bottleneck model. Second, it is optimized not to predict phones, but orthographic (letter) transcriptions. These are quite similar, but English orthography is still not completely transparent, which might help explain the model’s behaviour: distinct sequences of phones correspond to distinct sequences of letters (thus, allow for a high score on the experimental task), but the representation’s distances may capture similarities and differences exclusively found in spelling. We also note, however, that the model shows the largest discrepancy between the English and French stimuli (larger than the English listeners’). It is not immediately obvious how this could be attributed to predicting English letters versus phones.
This difference is clear from Figure 1 (left), which plots DeepSpeech’s discriminability scores against listeners’ averaged accuracy for each contrast, colour-coded for whether the items are English or French. We observe a clear separation in the distributions of the DeepSpeech’s discriminability of English contrasts (concentrated on the right-hand part of the graph, where the model is better) versus French contrasts. This separation is not visually salient in humans, nor in the the DPGMM model (right). This model seems to be over-trained on the task of discriminating English phones.
Finally, we consider the nature of the benchmark itself. While speaker variability was introduced in order to prevent listeners from attending to acoustic details, the delay between stimuli is still relatively short, meaning that listeners need not rely heavily on memory, and will thus still have reasonable access to detail. The stimuli are also short, and often do not correspond to full syllables, so that listeners may not treat them as fully speech-like. The fact that the A and B stimuli are from the same speaker may also attune speakes to small differences between those two stimuli, potentially thus attuning them to low-level differences overall. If listeners focus on detail, then the fact that the drop in human performance on non-native stimuli is small is unsurprising. The fact that the multilingual and DPGMM models are good at predicting the behaviour of English-speaking listeners may prove to be a consequence of this particular mode of listening. Other benchmark tasks are needed to obtain a fuller picture.
Stimuli extracted from running speech may be more ecological for evaluating typical speech recognition models, but they are difficult to interpret. While context was, in principle, held constant across each stimulus triplet, in reality, it is very difficult to get phonetically well-matched contexts in natural speech. Although the stimuli were selected by hand to minimize the differences due to surrounding context, they are not perfectly controlled, which means that the target (centre) phone is not the only thing driving human listeners’ decisions. Among the more difficult English contrasts for listeners here are English [f]–[v], which should not be particularly difficult, and French [f]–[y], which should be trivially easy. Items like these evidently do not highlight the desired contrast—and the fact that the locus of contrast was not always apparent might also have led listeners to attend to acoustic detail.
7 Related work
Our data set is in the spirit of other cognitive benchmarks for artificial intelligence (syntax: blimp; intuitive physics: intphys; question answering: natquest). In speech perception, the idea of matching human behaviour is not newKleinschmidt and Jaeger (2015); Feldman and Griffiths (2007); Schatz et al. (To appear, 2017); Schatz and Feldman (2018), and is an echo of the literature on modelling phonetic learning, most notably guenthergjaja, who qualitatively compared their modelled distances to similarities reported in the literature for human listeners. To our knowledge, the only previous work providing stimuli, human responses, and recommendations for generating predictions at the individual stimulus level with a wide range of models is millet2019comparing. Those stimuli only tested a narrow range of cross-linguistic phone contrasts, however, and were non-words read in a word-list style, rather than extracts of natural, running speech.
The PEB stimuli are drawn from the evaluation for the Zero Resource Speech Challenge 2017 Dunbar et al. (2017), widely used in evaluating unsupervised speech models. The PEB complements this existing measure (not scored against humans), and can be applied to any model tested on it.
8 Summary of Contributions
We have presented the Perceptimatic English Benchmark, an open English-language benchmark for computational models of human speech perception made up of stimuli that are ecological for typical speech models. It is the only open data set we know of that systematically probes a wide range of phone contrasts and is easy to compare with computational models. We have shown, for the first time, that a standard speech recognizer is not predictive of human phone classification behaviour, while models not optimized to recognize English phonemes are (a quasi-universal phone classifier and a model of short-duration acoustic events). The multilingual model is easy to use off-the-shelf,101010https://coml.lscp.ens.fr/docs/shennong/.
and we recommend it to researchers needing an estimate of perceptual distance.
- The fisher corpus: a resource for the next generations of speech-to-text. In LREC, Cited by: §4.0.3.
- The Zero Resource Speech Challenge 2017. In 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 323–330. Cited by: §7.
- Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner. Cognition 173, pp. 43–59. Cited by: §1.
- Exploiting the lawful variability in the speech wave. J. Perkell and D. Klatt (Eds.), Vol. 335, pp. 71–90. Cited by: §1.
- A rational account of the perceptual magnet effect. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29. Cited by: §7.
- SWITCHBOARD: telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 517–520. Cited by: §4.0.3.
- Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §4.0.3.
- Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review 122 (2), pp. 148–203. Cited by: §7.
- LMEDS: Language markup and experimental design software. Cited by: §2.0.3.
- Interactive processes in speech perception: The TRACE model. Cognitive Psychology 18, pp. 1–86. Cited by: §1.
- Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review 115 (2), pp. 357–395. Cited by: §1.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.0.3.
- Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv, pp. arXiv–1911. Cited by: §4.0.4.
- The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU), Cited by: §4.0.5.
- How should a speech recognizer work?. Cognitive Science 29, pp. 867–918. Cited by: §1.
- ASR systems as models of phonetic category perception in adults. In Proceedings of the 39th Annual CogSci Meeting, Cited by: §7.
- Early phonetic learning without phonetic categories: Insights from machine learning. Proceedings of the National Academy of Sciences. Cited by: §7.
- Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. In Proceedings of the Conference on Cognitive Computational Neuroscience, pp. 1–4. Cited by: §7.
- Quantifying kinematic aspects of reduction in a contrasting rate production task. The Journal of the Acoustical Society of America 141 (5), pp. 3580–3580. External Links: Cited by: §4.0.4.