In recent years, self-supervised learning has attracted lots of attention, of which the typical pipeline can be described as follows: generate pseudo labels from data itself, perform supervised learning to learn representations, and then transfer the learned representations for downstream tasks. Such methods have achieved great successes in improving downstream tasks performance and reducing the demand of labelled data in many fields such as computer vision(Chen et al., 2020; Kolesnikov et al., 2019; Xie et al., 2016) et al., 2019), and speech processing (Jiang et al., 2019). However, despite the various constraints proposed, the dimensions of learned representations are normally lack of explicit physical meanings. Therefore, it is an important but unsolved problem of how to make a data representation generated from self-supervised learning have an explicit physical meaning so that the representation becomes explainable.
Given an observed data, there must be an underlying physical process that generates it with certain input. If we have access to such a physical forward process, either physical or physically simulated, we may design a learning procedure to infer the latent input parameters as a physically meaningful representation of the observed data. Normally, to solve such an inverse problem, the physical process can play a role as the decoder in auto-encoder parlance that generates an output for supervision by taking the inferred parameters as input. However, there are several challenges:
Non-differentiable physical process. The physical process may be non-differentiable w.r.t. its input, thus auto-encoder like methods cannot be directly applied.
Non-trivial to sample data. Given a physical process, it is natural to consider constructing a synthetic paired training dataset by sampling inputs first and then generating outputs. However, an explicit prior of input parameters is normally difficult to obtain, while random sampling can be computationally infeasible and may end up with meaningless samples.
Generalization issues. Even when priors or samples of input parameters are available, a trained model may suffer the generalization problem when applied to unseen observations.
To tackle these challenges, we propose a novel analysis-by-synthesis procedure by iteratively sampling and training. At the sampling step, given an observed data, a neural network is used to approximate the intractable posterior of input parameters, then parameters are sampled from the posterior to generate outputs in the observational space through the physical process; At the training step, the same network is trained on the sampled paired data to predict input parameters from observations. These two steps operate iteratively and boost each other, similar to the guess-try-feedback process in human learning. The entire learning procedure integrates the physical generation process to obtain meaningful parameter estimates in a self-supervised manner. The proposed method can address the above-mentioned three challenges as follows:
The physical process is only used for generating outputs, without passing gradients through.
No need of priors on input parameters, the model learns from unlabelled data to approximate the posterior of input parameters for efficiently sampling.
The self-supervised nature of the training procedure means a trained model can adapt itself to unseen observations.
We verify the proposed method by tackling the acoustic-to-articulatory inversion problem to extract articulatory kinematics information from speech. We adopt the Tube Resonance Model (TRM) (Hill et al., 2017)
to simulate the human vocalization mechanism. Given unlabelled reference utterances and starting completely from scratch with random initialization, a network can learn to infer how to control the TRM model to synthesize sound similar to the reference utterances by alternately sampling data and training itself. Experiments show that our proposed algorithm can converge steadily. The synthesized sounds achieve a signal to noise ratio of around 16dB on both single-speaker and multi-speaker datasets. Further experiments show that trained models can generalize well to unseen speakers or even a new language, and performance can be further promoted through self-adaptation.
2 Related Work
Inverse problems aim to infer causes from observations. Traditionally, analytical methods have been used to solve inverse problems such as image super-resolution (SR)(Park et al., 2003), computed tomography (CT) (Natterer, 2001)
. In recent years, data-driven methods especially the deep learning approach has achieved state-of-the-art results(Brooks et al., 2019; Liu et al., 2019). For those problems with a differentiable or invertible forward process, methods have been proposed to solve problems like SR (Sønderby et al., 2016).Otherwise, the key to solve an inverse problem in a data-driven way is to obtain supervision data. When a physical forward process is available, paired training data can be synthetically generated by applying the forward process to samples of input parameters (e.g., adding noise to clean images for denoising training). However, appropriate priors or samples of the input parameters are not always available, while the parameter space can be high dimensional, making input parameters sampling non-trivial.
To solve an inverse problem for a given observed data, sampling from the posterior distribution can be efficient. Markov chain Monte Carlo (MCMC) is a widely used sampling method, but it requires an explicit prior and can be computationally infeasible in a high-dimensional space. As for deep sampling methods,(Adler and Öktem, 2018) proposed a method to directly sample from the posterior with a conditional generative adversarial network (CGAN), but this method requires supervised training. Variational auto-encoders (VAEs) (Diederik et al., 2014) directly approximate the posterior of latent variables with a neural network. Similar to that, we use a neural network to approximate the posterior of parameters and sample from it to generate paired training data. However, since the forward process (corresponding to the decoder in VAE) can be non-differentiable, we turn to minimize the reconstruction error of latent input parameters rather than observed data.
Acoustic-to-Articulatory Inversion (AAI)
The motor theory (Liberman and Mattingly, 1985) indicates that when perceiving speech, except for acoustic features, human also perceive articulatory information. There has been a lot of work to infer articulatory kinematics information from speech. (Afshan and Ghosh, 2015; Mitra et al., 2017; Chartier et al., 2018) train neural networks for acoustic-to-articulatory inversion on simultaneously recorded speech acoustics and articulatory data in a supervised manner. Since articulatory data is often recorded by ElectroMagnetic Articulography (EMA), or Magnetic Resonance Imaging (MRI), massive articulatory recordings can be too expensive to obtain. Besides, trained models are always speaker-dependent.
Given an articulatory synthesizer, either mechanical (Fukui et al., 2009; Yoshikawa et al., 2003) or physically simulated (Howard and Messum, 2014), articulatory synthesis takes a time series of articulatory parameters to produce speech signal. Several methods (Higashimoto and Sawada, 2002; Asada, 2016) have been proposed to infer proper articulatory parameters to reproduce a given speech by building parameter-sound pair datasets first and then training a model with the datasets. These approaches deal with either a single vowel or a single word with two to three syllables, thus, it is possible to build those datasets. (Gao et al., 2019) developed a generic method to do copy-synthesis of speech with good quality, but the method is time consuming when applied for inference. In this work, we adopt a simulated vocalization model as synthesizer to reproduce speech with arbitrary length.
3.1 Problem Senario
Traditionally, an inverse problem is formulated to solve an equation in the form of , where is the measured data (observation); is the input parameter (latent variable) of the forward operator ; both and
can be high dimensional vectors;describes how measured data is generated from the input parameters in the absence of noise; is the observational noise. For clarity, we include the noise model into the forward operator in the following description. In case that is non-linear and not differentiable w.r.t. its parameters, it will be difficult to apply analytical methods. We consider solving the inverse problem through learning.
Given supervised training data , an inverse problem can be formulated to find an operator, where solves:
In this work, we aim to solve an unsupervised inverse problem, i.e., given observed data and a forward operator , to find an operator , where solves:
In which is the reconstruction error, while is the potential prior constraints on the parameters. If observational noise is additive Gaussian, . When the reconstruction error is small enough, it is fair to say is a solution to the inversion of , regardless of prior constraints.
Given a forward operator , in principle, we can sample from the parameter space , and then apply the forward operator to obtain paired data for training. However, in many problems, it is non-trivial to obtain appropriate priors while the latent space can be high dimensional, thus random sampling can be computationally infeasible.
Given an observational data , it can be efficient to sample around the corresponding latent variable (which we do not know), that is, sampling from the posterior given , from a Bayesian perspective. The problem is that the explicit expression of the posterior is normally intractable to obtain. However, consider Equation (1) for supervised learning, is actually the posterior distribution given . Intuitively, given an observed data , suppose we can approximate the posterior distribution well, we can sample from it and then generate training data, which in turn can help optimize the approximation of the posterior distribution in a supervised way.
From such an intuition, we propose to solve the unsupervised inverse problem with an iteratively sampling and training procedure, as shown in Algorithm 1. At the sampling step, a neural network is used to approximate the posterior of latent variables, we then sample from the posterior and apply to generate paired supervised data; At the training step, the same network is trained to optimize the posterior distribution approximation with the sampled supervision data. After several iterations, we can obtain a good approximation of the posterior through the trained network, which in turn means the reconstruction error in Equation (2) of the predicted parameters can be small enough to consider the unsupervised inverse problem solved. We name the algorithm as EMbodied Self-Supervised Learning (EMSSL), since it integrates a physical forward process into learning procedure. When the learning procedure ends, we can run the inference network in a deterministic way to obtain inversion results.
In practice, if is continuous, it is common to describe the posterior with the multivariate Gaussian: , in which is a superparameter. We noticed in experiments that if given enough observed data, the sample number per datapoint can be set to 1, so that the deterministic output of the inference model can be directly adopted as a sample. Otherwise, sampling several latent variables and weighting the samples as in (Burda et al., 2016) may help.
There are many options to update the training set . The simplest way is to replace the training set with the latest sampled dataset, i.e. ; to fully utilize historically sampled data and stablize the learning procedure, we can drop part of the historical data at a fixed percentage in each iteration, then merge the left historically sampled data with the latest sampled set, as shown in Algorithm 2.
4 Experimental Setup
We verify the proposed method by tackling the acoustic-to-articulatory inversion problem. In this section, we introduce the adopted articulatory synthesizer, model architecture and training procedure. Training setup and evaluation metrics are also provided.
4.1 Articulatory Synthesizer
We adopt the Tube Resonance Model (TRM) (Hill et al., 2017), which simulates the propagation of sound waves through a tube by waveguide techniques, as our articulatory synthesizer. Composed of a vocal tract with 8 segments and a nasal cavity with 5 segments, the TRM accepts 26-dimensional utterance-rate parameters and 16-dimensional time varying control-rate parameters as input to synthesize sound. Utterance-rate parameters specify the global state of the tube, such as tube length, glottal pulse, breathness. Control-rate parameters dynamically control the tube to produce time varying sounds by changing diameters of segments and velum, setting micro intonations, and inserting fricatives or aspiration as needed. For details of the TRM model, please refer to (Hill et al., 2017). Hereinafter, we refer to the utterance-rate parameters and control-rate parameters as articulatory parameters.
Overview of an iteration (a) and the model architecture (b). (a) At the sampling step, output of the network is directly adopted as a sampling result; (b) Strides are specified to conserve sequence lengths, when input a mel-spectrogram with shape(N denotes the frame number), the output of the last up-conv layer will have shape , ignoring dimensions of batchsize and channels.
4.2 Learning Procedure and Model Configuration
Given speech data and the TRM model, we apply the proposed EMSSL framework to learn an inference model of the articulatory parameters through iterations. Each iteration involves a sampling step and a training step as illustrated in Figure 1(a). At the sampling step, articulatory parameters are sampled and fed to the TRM model to synthesize speech; At the training step, the inference model is trained on the generated paired data in a supervised manner. Note that the inference model applied during these two steps is exactly the same.
The model takes mel-spectrograms as input and outputs both control-rate and utterance-rate parameters, with an architecture illustrated in Figure 1(b). A U-Net (Ronneberger et al., 2015) like CNN structure is used to obtain features with different time and frequency scales. Kernel sizes of all convolutional layers are set to 3 and strides are specified to conserve sequence lengths. A layer of bidirectional LSTM (BLSTM) (Graves and Schmidhuber, 2005)
with hidden size 128 is stacked on top of the convolutional layers. The forward and backward outputs of BLSTM at every time stamp are concatenated and mapped to 16-dimensional control-rate parameters through a linear layer, while the last cell states of BLSTM are concatenated and then mapped to 13-dimensional utterance-rate parameters (the other 13 dimensions are fixed during training, details are given in the supplementary material) through another linear layer. ReLU activation is used for all convolutional layers while tanh activation is used for the linear layers. Both the loss of utterance-rate parametersand the loss of control-rate parameters are mean square errors, and our optimization objective is to minimize , where is a superparameter.
4.3 Training Details
We use 80-dimensional log magnitude mel-spectrogram with 50 ms frame length and 12.5 ms frame shift, following the practice in (Wang et al., 2017)
. The inferred articulatory parameters are rescaled before fed to the TRM model, and the control-rate parameters are interpolated to 250 Hz to meet the requirement of the TRM model. Sample number per datapointis set to , so that the deterministic output of the network can be directly viewed as sampling results. We apply Algorithm 2 to update the training set with a data drop rate .
The neural network is randomly initialized and trained with the loss weight . We use the Adam optimizer with learning rate of , with and . There is no need to complete optimization in each iteration, we train for 10 epochs in a single iteration. Batch sizes vary from different datasets. We train the model on a computer with 2 NVIDIA V100 GPUs and 2 Xeon E5-2690 V4 CPUs.
4.4 Evaluation Metrics
We aim to train a model which can infer articulatory parameters from a reference utterance, so as to synthesize an utterance similar to the reference one. We measure the similarity between the synthetic and reference utterances by the signal to noise ratio (SNR) of the synthetic mel-spectrogram , the higher the SNR, the more similar the synthetic and reference utterances are. The SNR was expressed as:
where N denotes the frame number; M denotes the number of bins; and are the jth bin of the ith frame of the reference and synthetic mel-spectrograms, respectively. To check the quality of the synthetic mel-spectrogram at different time stamps and different frequencies, the SNR at each time-frequency point can be used, which can be expressed as:
For clarity, we refer to as sentence SNR and as local SNR.
In this section, we conduct experiments on single-speaker and multi-speaker datasets to demonstrate the convergence property of EMSSL and report model performance with qualitative and quantitative results. We then show that trained models can generalize well to datasets of unseen speakers or even a new language, and the model performance can be further improved through self-adaptation.
5.1 Single-speaker Evaluation
Dataset and Setup
We conduct experiments on the LJSpeech dataset (Ito, 2017), which contains 13,100 English audio clips of a single female speaker, with a total audio length of approximate 24 hours. We randomly choose 5400 clips for training, 540 clips for validation and 540 clips for testing. The experimental setup follows Section 4.3 with a batch size of 180 clips.
The convergence process is shown in Figure 2(a). While the training cost reduced from to around in general, there is always a cost increase at the beginning of each iteration and the training cost seems to oscillate through iterations. Both of the results are due to the fact that we update the training set with newly sampled data at the beginning of each iteration. However, rather than to minimize the training cost, our purpose is indeed to improve the similarity between the synthetic and reference utterances. Therefore, at the beginning of each iteration, we directly evaluate the model performance on the training and validation set by (i) apply the inference model to extract articulatory parameters; (ii) synthesize utterances; (iii) calculate the sentence SNR of the synthetic utterances following the Equation (3). As illustrated in Figure 2(a), sentence SNR increases steadily with iterations, and stablizes after around 60 iterations. Each iteration takes around 15 minutes to complete. We call this trained model as the LJ-model.
We evaluate the trained model on the test set and get a mean sentence SNR of dB. A pair of reference and synthetic mel-spectrograms from the test set are shown in Figure 3(a). We also attach several audio samples in the supplementary material for reference. As demonstrated by the samples, the LJ-model successfully infers the underlying articulatory parameters of the reference utterances. We also calculate the local SNR of the test set as defined in Equation (4) and do statistics for every mel-bin. As illustrated in Figure 3(b), the reference and the synthetic spectrograms have high similarity in details, with the medians of local SNR above 20dB for most bins, while the first low frequency bins have a lower local SNR, due to the fact that the energy of those bins is rather low.
Test results of the LJ-model. (a) A reference (top) and the corresponding synthetic (bottom) mel-spectrograms, the corresponding text is "During the period the Commission was giving thought to this situation"; (b) SNR are truncated to [-20, 60], middle bars represent medians, boxes represent the interquartile range(IQR; 25th-75th percentile), whiskers extend to 1.5IQR outside the IQR, and outliers are represented as single points.
5.2 Multi-speaker Evaluation
Dataset and Setup
We conduct multi-speaker experiments on the ARCTIC dataset (Kominek et al., 2003) with a total audio length of approximate 7 hours. The dataset contains utterances of 7 speakers (5 males and 2 females), with around 1150 clips for each speaker. We first randomly split the utterances of each speaker into training, validation and testing sets with a ratio of 8:1:1, then merge the training sets of different speakers for training, while keeping track of each speaker for validation and testing. The experimental setup follows Section 4.3 with a batch size of 300 clips.
The sentence SNR curves on validation sets for all speakers are shown in Figure 2(b). As can be seen, though there is performance difference between speakers, quality of synthetic utterances improve steadily for all speakers in general. After 60 iterations, the averaged sentence SNR on the merged testing set is dB, and the trained model is called as the ARCTIC-model.
5.3 Generalizing and Adapting to Unseen Speakers and New Languages
We construct test sets of unseen English speakers from the test-clean set of the LibriTTS Corpus (Zen et al., 2019). To ensure enough clips for adaptation experiments, we filter out those speakers with less than 160 clips, resulting in 13 speakers (4 males and 9 females). We then randomly choose 4 female speakers along with all the 4 male speakers to build separated test sets with 160 clips for each speaker. We construct a test set of Mandarin Chinese by randomly picked 160 clips from the Chinese set of the Css10 Corpus (Park and Mulc, 2019) which is constructed from two audiobooks read by a female.
We evaluate the trained LJ-model and ARCTIC-model for all the 8 unseen English speakers. As shown in Figure 4(a), both the models achieve a relatively high sentence SNR (above 15dB) for most unseen speakers. Meanwhile, the LJ-model does not generalize as well as the ARCTIC-model to male speakers, but outperforms the ARCTIC-model on the test sets of female speakers. These differences are consistent with their training data. We then evaluate the LJ-model, which is trained on an English dataset, on the Chinese test set, resulting in a sentence SNR of dB. The result reflects that the LJ-model can generalize to Chinese well, though with some performance loss.
Furthermore, the self-supervised nature of EMSSL makes it possible for a trained model to adapt itself to obtain a better estimation of underlying articulatory parameters when met unseen utterances. We conduct adaptation experiments with the LJ-model on the test set of Speaker 1089 (i.e. the speaker with ID 1089, on which the LJ-model has poorest performance) and on the Chinese test set. To explore how few unseen utterances are enough for adaptation, we randomly pick 20, 40, 80, 160 clips from the test sets and finetune the LJ-model on the picked clips.
As shown in Figure 4(b) and Figure 4(c), sentence SNR on the picked clips continuously improves as adaptation proceeds, and generally, adaptation stability and model performance benefit from more adaptation samples. After adaptation, sentence SNR improves from dB to dB on the test set of Speaker 1089 and from dB to dB on the Chinese test set, which means adaptation is an effective way to transfer trained models to unseen data.
6 Conclusion and Future Work
We have proposed a novel approach EMSSL to solve inverse problems in a self-supervised manner, so as to obtain a physically meaningful representation of the observed data. By integrating a physical forward process, the proposed approach works in an analysis-by-synthesis procedure by iteratively sampling and training, which can be viewed as a form of embodied learning. We verify the proposed method by tackling the acoustic-to-articulatory inversion problem. Given an articulatory synthesizer and reference utterances, the model learns from scratch to extract articulatory parameters to synthesize speech very close to the reference utterances. Besides, our experiments demonstrate that a trained model can be transferred to unseen speakers or even a new language through self-adaptation.
Articulatory kinematic information can benefit pronunciation teaching, phonology research, machine speech recognition and synthesis. To be applied in such fields, properties of the extracted articulatory information need to be further analysed in future.
The proposed EMSSL framework can be applied to solve inverse problems in those fields with forward processes, such as simulated or physical robots or manipulators, especially when input parameters are non-trivial to sample and the forward processes are non-differentiable.
Last but not least, due to motor equivalence phenomena (Perrier and Fuchs, 2015), acoustic-to-articulatory inversion suffers from the problem of non-uniqueness, which is also common among other inverse problems. Attention should be paid to such aspects when applying EMSSL to solve inverse problems.
a) If the proposed self-supervised method is adopted, the resources that would otherwise be used to obtain labelled data can be saved. When used to extract articulatory information as in our experiments, the method can benefit language cultural preservation and second language learning. Specifically, the method can benefit phonology research so as to better protect language culture, especially for dialects and minority languages; while in education, the method can promote pronunciation teaching.
b) As far as we have concerned, nobody will be put at disadvantage from this research.
c) The proposed method can check whether success or failure when running, thus, no external consequences will occur even the system fails.
d) The proposed method does not have the problem of leveraging the biases in the data.
The work was supported in part by the National Natural Science Foundation of China (No. 11590773) and the National Social Science Foundation of China (No. 15ZDB111). We also acknowledge the High-Performance Computing Platform of Peking University for providing computational resources.
- Deep bayesian inversion. arXiv preprint arXiv:1811.05910. Cited by: §2.
- Improved subject-independent acoustic-to-articulatory inversion. Speech Communication 66, pp. 1–16. Cited by: §2.
- Modeling early vocal development through infant–caregiver interaction: a review. IEEE Transactions on Cognitive and Developmental Systems 8 (2), pp. 128–138. Cited by: §2.
Unprocessing images for learned raw denoising.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045. Cited by: §2.
Importance weighted autoencoders. International Conference on Learning Representations. Cited by: §3.2.
- Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98 (5), pp. 1042–1054. Cited by: §2.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1.
- Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Vol. 1. Cited by: §2.
- Three dimensional tongue with liquid sealing mechanism for improving resonance on an anthropomorphic talking robot. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5456–5462. Cited by: §2.
Articulatory copy synthesis based on a genetic algorithm. Proc. Interspeech 2019, pp. 3770–3774. Cited by: §2.
- Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks 18 (5-6), pp. 602–610. Cited by: §4.2.
- Speech production by a mechanical model: construction of a vocal tract and its control by neural network. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 4, pp. 3858–3863. Cited by: §2.
- Low-level articulatory synthesis: a working text-to-speech solution and a linguistic tool 1. Canadian Journal of Linguistics/Revue canadienne de linguistique 62 (3), pp. 371–410. Cited by: §1, §4.1.
- Learning to pronounce first words in three languages: an investigation of caregiver and infant behavior using a computational model of an infant. PloS One 9 (10). Cited by: §2.
- The lj speech dataset. Note: https://keithito.com/LJ-Speech-Dataset/ Cited by: §5.1.
- Improving transformer-based speech recognition using unsupervised pre-training. arXiv preprint arXiv:1910.09932. Cited by: §1.
- Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1920–1929. Cited by: §1.
- CMU arctic databases for speech synthesis. Cited by: §5.2.
- The motor theory of speech perception revised. Cognition 21 (1), pp. 1–36. Cited by: §2.
Coherent semantic attention for image inpainting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4170–4179. Cited by: §2.
- Joint modeling of articulatory and acoustic spaces for continuous speech recognition tasks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5205–5209. Cited by: §2.
- The mathematics of computerized tomography. SIAM. Cited by: §2.
- Css10: a collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269. Cited by: §5.3.
- Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine 20 (3), pp. 21–36. Cited by: §2.
- 11 motor equivalence in speech production. The handbook of speech production, pp. 225. Cited by: §6.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.2.
- Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490. Cited by: §2.
- Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §4.3.
Unsupervised deep embedding for clustering analysis. In
International conference on machine learning, pp. 478–487. Cited by: §1.
- A constructivist approach to infants’ vowel acquisition through mother–infant interaction. Connection Science 15 (4), pp. 245–258. Cited by: §2.
- LibriTTS: a corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882. Cited by: §5.3.