High-quality articulatory speech synthesis provides compelling possibilities for studying articulatory phonetics and constructing low-resource speech technologies. However, many of these scenarios require a mapping from linguistic specification, e.g. a sequence of phonetic symbols to articulatory gestures. To date, such mappings have been developed manually by developers or users of the particular synthesizer. VocalTractLab [birkholz2005thesis, birkholz2013coart], a state-of-the-art synthesizer that simulates a vocal tract based on magnetic resonance imaging (MRI) data, is a system capable of producing natural sounding speech [krug11intelligibility]. However, the increase in realism comes with an increase in time and expertise required to develop gestures and the increase in speech quality reflects subtle articulatory choices which makes it difficult to develop language or dialect-independent gestures manually.111Furthermore, gestural mappings will need to be revisited each time a new speaker model is introduced. The result is that the set of available gestures will be, at best, incomplete or in the worst case only appropriate for a specific language.
This problem can be resolved by a procedure that learns the articulatory gestures to produce linguistically relevant utterances automatically. To be practical, the process should function without detailed articulatory phonetic information such as aligned MRI data or expert knowledge. This resembles the task of spoken language acquisition, in general, and early vocal learning in particular [jusczyk1997spokenlang]. In this paper we implement one of the central processes for autonomous learning of speech production, namely articulatory exploration or “babbling”, as an auditory optimisation task [oller1988babbling].
Using this simulation, we show that it is possible to discover the articulation of syllables with complex onsets by using a perceptual encoder trained on a general speech recognition corpus to provide auditory objectives. Furthermore we demonstrate the relative success of different exploration strategies and examine the nature of articulatory solutions in terms of coarticulation between consonant and vowel gestures.
This is a significant development towards an autonomous process for constructing linguistic-to-articulatory mappings for any language or dialect and provides a framework for investigating theoretical questions in articulatory phonetics and speech perception in vocal learning.
As in previous work [vniekerk2020cvopt] we formulate articulatory exploration as an optimisation task using VocalTractLab (VTL) to produce candidate utterances. However, in this work, novel mechanisms including well-motivated somatosensory specifications and a language-oriented auditory perceptual mapping form part of the objective function. The process of discovering linguistically relevant gestural mappings is illustrated in Figure 1 and is briefly motivated in the following subsections.
2.1 Articulatory exploration
Babbling during early vocal learning has often been simulated as a goal-directed or imitative process, usually involving a set of auditory objectives [bailly1997sensmot, rasanen2012phonla, rasilo2017learnvowel, pagliarini2021review, philippsen2021speechacq]. This type of exploration is also considered central to finding appropriate inverse models during sensorimotor learning in general [jordan1992distal, rolf2010goalbabble].
We implement goal-directed articulatory exploration as the process of minimising auditory and somatosensory losses to discover a linguistically relevant utterance. The central block in Figure 1 is the global optimisation task
of finding the articulatory gestures
that minimise the loss function, with the combined auditory/articulatory goal, the auditory perceptual mapping described in Section 2.2 and
the speaker vocal tract model. In this paper, we use the Tree-structured Parzen Estimator approach[bergstra2011hpopt] as the algorithm to drive the articulatory sampling.
2.2 Auditory perceptual objectives
Most vocal learning simulations assume that auditory objectives are derived from the speech signal alone, i.e. acoustic imitation. This approach suffers from two issues: the speaker normalisation and correspondence problems. The speaker normalisation problem refers to the difficulty of finding linguistically correct utterances when comparing speech produced by different speakers; e.g. it is well known that formant frequencies vary systematically with speakers’ vocal tract length and that this may affect speech recognition performance [waibel1997vtln]. The correspondence problem is one of associating articulatory gestures obtained for an acoustic reference to linguistic contexts [nehaniv2002corrprob, brass2005imit, philippsen2021speechacq]. We have argued elsewhere, based on the well-known finding that language-oriented speech perception precedes the onset of canonical babbling in infants [kuhl2004evoc], that these problems can be addressed by a language-oriented auditory perceptual mapping derived from linguistically grounded multi-speaker speech stimuli [vniekerk2022cvopt].
2.3 Articulatory objectives
Articulatory objectives represent explicit objectives that originate from non-auditory signals. In humans it is known that sighted individuals may benefit from visual information [murakami2015seeingu] and speakers may track the implementation of these objectives through somatosensory feedback [nasir2006somatosensory]. However, since we rely on a general optimisation algorithm with uniform priors instead of a physiologically motivated approach (as for example in [serkhane2007fcsim, nam2013apsim]) and the speaker model does not incorporate inertial and other relevant process measurements, articulatory objectives may also serve as an implicit mechanism that regularises the solution space, i.e. resulting in more prototypical articulatory gestures.
In this paper we employ one set of somatosensory objectives that could be derived from visual information: plosive consonants at the start of the syllable should form an oral closure and the vowel is associated with an open vocal tract. We also experiment with a regularisation objective to induce intra-syllable coarticulation.
2.4 Speech production
To produce articulatory trajectories, we use the target-approximation model (TAM) [xu2001tam] which has been adopted in VTL to realise utterances represented by articulatory targets [birkholz2007gestures]. The resulting parameterisation of the articulatory dynamics combined with simplifying assumptions of synchronisation [birkholz2011qtacv] has enabled the reliable discovery of simple CV syllables using derivative-free optimisation or even random sampling [xu2019icphs, vniekerk2020cvopt]. While previous works implemented coarticulation [xu2020sylsync, liu2021sylsync] by explicitly parameter tying [xu2019icphs, vniekerk2020cvopt], this work tests the hypothesis by including coarticulation as an articulatory objective, which further reduces the explicit knowledge required in the process, as described in the next section.
3 Experimental setup
The first step towards implementing the articulatory exploration process (Figure 1) to find CCV syllables is to construct the auditory perceptual mapping that produces syllable embeddings or percepts. For this purpose, we used the LibriSpeech speech recognition corpus [panayotov2015librispeech]
representing linguistically grounded speech. Vowel (V), consonant-vowel (CV) and CCV syllable onsets were extracted from the clean training set and used to train a recurrent neural network that encodes the audio to a low-dimensional space related to the linguistic context. The Mel-spectrogram was used as input and the output vector was a concatenation of one-hot encoded phonetic identities defined in the ARPABET phoneset (used in the CMU pronunciation dictionary[cmu2000cmudict]) which is appropriate for the American English speech data. The resulting vector representing a V, CV or CCV syllable onset is 64-dimensional – two sub-vectors encoding 24 consonants (including absence) and one representing 16 vowels. Evaluating this encoder on the Librispeech test set by converting the output to a categorical form results in a recognition rate of 73%.
For speech synthesis, VTL222Version 2.3 available at https://www.vocaltractlab.de was used to realise articulatory targets with the “JD2” male speaker and geometric glottis model [birkholz2019geomglot]. Since we were focused on investigating the upper vocal tract parameters, the glottal parameters were kept constant at the appropriate preset values for the particular segment (e.g. “modal voice” for the vowel), with the exception of the chink area and relative amplitude which were free to be optimised to allow control of the voice onset time. All of the upper vocal tract parameters (Table 1) were free for optimisation, except the velum opening (VO) which was kept closed and the tongue root (TRX, TRY) parameters which were derived from the tongue body values [krug2022efficient]. Timing in the target-approximation trajectories was controlled by two free parameters, one time constant each for the glottal and upper vocal tract parameters.
The somatosensory objectives relied on the same VTL configuration to provide proprioceptive or tactile feedback by means of the tube areas function. Two simple objectives were defined and represented with values in the range . (1) The vocal tract closure objective value is 0 when a vocal tract closure is required, e.g. for plosive consonants, and 1 when a minimum opening is required, e.g. for vowels. (2) The lip closure objective value is 0 when a closure is formed by the lips and 1 when the lips are open. This is motivated by visual information and was only applied to the first consonant () depending on whether it is a bilabial or other type of plosive. No somatosensory objectives were applied to the intermediate consonant ().
A single regularisation objective was implemented by quantifying the coarticulation between any two articulatory targets as the normalised distance between the range-normalised upper vocal tract vectors :
with the number of upper vocal tract dimensions. In Section 4 we specifically compare systems with and without this coarticulation objective between each consonant and the vowel. Finally, all the relevant articulatory objectives were concatenated into a single vector and combined with the auditory percept to form the optimisation goal .
To implement the optimisation algorithm [bergstra2011hpopt] we used the hyperopt333https://github.com/hyperopt/hyperopt (v0.2.5) software package. The articulatory space was defined by the speaker model and initially sampled uniformly. The loss function was defined as the weighted sum of the Euclidean distances calculated for the individual sub-vectors, with auditory components having a weight ratio of 2:1 to the articulatory component. For each distinct syllable an ideal objective vector was constructed based on its phonetic constituents. The optimisation algorithm samples articulatory targets which are synthesised by VTL and each sample is evaluated by the auditory perceptual mapping and VTL tube areas function to determine the resulting vector and associated loss. To improve the computational efficiency of the process, the synthesis of speech and auditory evaluation is only performed when the somatosensory objectives are satisfied. In the case of failure to achieve these objectives, the loss function is set to an arbitrary large value proportional to the loss associated with .
The proposed framework was evaluated through an experiment designed to investigate the following aspects:
Exploration strategies for complex syllable onsets: Is it necessary to optimise certain segment targets jointly or can this be done independently and in what sequence?
Coarticulation: Can we make use of the regularisation objective defined above to reproduce natural observations associated with intra-syllable coarticulation [liu2021sylsync]?
Sufficiency: What is the relative success rate of the process for the range of CCV syllable types occurring in American English and what are the implications for future work?
Since the process is non-deterministic, dependent on random initial exploration, we estimated the success rate using independent repeated trials. To allow for comparison of different syllable types (aspect 3) we set up 5 trials for each of the 150 valid combinations444Determined by the existence of entries in the CMU dictionary and the LibriSpeech corpus. of the following sets of segments: /b,d,f,g,k,p,s,S,t,T/, /k,l,p,r,t,w/, and /A,æ,2,E,O,I,i:,U,u:/; a total of 750 independent trials for each experimental setup. Each trial was allocated 5000 iterations leading to a synthesised utterance, i.e. excluding articulatory targets that do not satisfy the basic somatosensory objectives. Four different exploration strategies were investigated (aspect 1):
Single-pass, joint (): Find all segment targets jointly and select the best sample after 5000 evaluations.
Two-pass, vowel then onset (): Find the vowel targets by producing vowel-only utterances in a first pass, then find the onset consonants jointly by producing CCV utterances using the best vowel targets from the first pass.555To allow for varying coordination requirements in different contexts, the glottal parameters and time constants are never fixed but re-optimised during each pass. The number of iterations are allocated to the two passes in the ratio 1:4.
Three-pass, with first (): Find the best targets for each segment by producing , and utterances in respective passes (iterations are allocated in the ratio 1:2:2).
Three-pass, with first (): As in the previous configuration, but with the intermediate consonant explored first.
Each of the above strategies were implemented with and without the coarticulation objective (aspect 2) and the best outcome from each trial was evaluated by the syllable encoder. This was done by mapping perceptual representations to symbols using the argmax operation on each sub-vector and calculating the identification rate.
The overall results are summarised in Table 2 in terms of identification rates, from which we note the following: (1) There is no significant difference666 in the auditory success rate when comparing conditions with and without the coarticulation objective. This means that adding the additional articulatory objective has no negative effect on the goal of optimising for auditory perception. (2) In both cases exploring last on the basis of a pre-optimised utterance leads to significantly worse results (underlined). (3) When applying the coarticulation objective, jointly optimising the vowel and consonants results in significantly worse outcomes for the vowel ( underlined) and optimising the consonants jointly after the vowel results in a significantly better outcome for syllables ( in bold).
By repeating this analysis over onsets and vowels we find that the outcomes with and without coarticulation in terms of auditory identification rate are similar in all contexts. In general, all contexts, except /dr, dw, tr, tw/ for onsets and /U,i:/ for vowels, have identification rates in excess of 80%, indicating that those cases are particularly difficult to discover.
To confirm that the coarticulation objective has the intended effect, articulatory distances are visualised in Figure 2. We see that: (1) There are significant reductions in the distance between and of some articulatory parameters in each case. (2) Some expected patterns of articulatory overlap emerge, e.g. the bilabial targets have lower distances to the vowel in many dimensions except for the lip distance (LD).
Since the optimisation process is partially dependent on the auditory perceptual mapping we are interested in (Table 2), the results should be interpreted carefully. The absolute identification rates are not directly comparable with the recognition rates obtained on natural speech. Instead, we make use of the results in the following ways: (1) To compare the relative success of different exploration strategies in terms of the auditory objectives. (2) To indicate problematic contexts that require further work. The difficult cases listed in Section 4 could be addressed in future by additional somatosensory objectives or more ecologically plausible articulatory sampling in the case of the onsets and better modelling of duration in the case of the vowels.
The current optimisation-based simulation of babbling is a first step in learning articulation without manual intervention and will be supplemented by a gradient-based learning process towards fluent articulation in future. We invite the interested reader to listen to the samples available at https://github.com/danielshaps/evoclearn_optccv_2022 to get a qualitative sense for the extent to which these correspond to actual babbling utterances.
We have presented a flexible simulation of babbling that consists of auditory perceptual, somatosensory and regularisation objectives and demonstrated that it can discover the articulation of syllables with complex onsets. This framework was used to compare different exploration strategies leading to an effective process where the vowel targets are found independently in a first pass and the consonants in the onset are jointly optimised using the vowel as an “anchor”. With this two-pass procedure it is possible to apply the coarticulation objective (Eq. 2) without negatively affecting the outcomes in terms of the auditory perceptual goals. This means that the framework and analysis presented in Figure 2 can be used to discover the relative (in)dependence of articulatory dimensions in different contexts automatically – parameter tying was done manually in previous work [xu2019icphs, vniekerk2020cvopt]. Furthermore, this form of regularisation could prove useful if the utterances generated here are used as a basis for learning forward and inverse models of articulation [jordan1992distal].
This work has been funded by the Leverhulme Trust Research Project Grant RPG-2019-241: “High quality simulation of early vocal learning”.