We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a template fitting technique. Furthermore, it uses image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation concludes that limiting the degrees of freedom for the anatomical and speech related variations to 5 and 4, respectively, produces a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate a plausible tongue animation by tracking sparse motion capture data.READ FULL TEXT VIEW PDF
We present a technique for the animation of a 3D kinematic tongue model,...
The various speech sounds of a language are obtained by varying the shap...
Real-time magnetic resonance imaging (RT-MRI) of human speech production...
In this work, we present a pipeline for characterization of bladder surf...
To date, several automated strategies for identification/segmentation of...
Hand modeling is critical for immersive VR/AR, action understanding, or ...
Motion analysis is used in computer vision to understand the behaviour o...
In computer vision, analyzing human motion is a research area that has been active for a long time(Wang et al., 2003; Chen et al., 2013). Here, it is of interest to e.g., track the typical shape changes of the whole body (Vasilescu, 2002; Ning et al., 2004) or of specific parts, such as the face (Turk and Pentland, 1991; Oliver et al., 2000)
during human behavior. The acquired information can then be further processed to learn typical patterns of human motion. This derived knowledge is useful for a wide array of applications: for example, it can be used to automatically recognize human activity, which is for instance helpful in ambient-assisted living or surveillance settings. Moreover, this information can be utilized to synthesize these motions, for example in computer graphics to create realistic animations or to imbue virtual avatars with more natural behavior, which in turn could improve human-machine interaction. In these tracking approaches, it is sometimes beneficial to use prior knowledge in order to constrain the motion to movements that are typical for the observed body part. This constraint may prevent such an approach from falsely detecting unrealistic shape deformations or help it to estimate the full motion from sparse data. Examples of such prior knowledge are statistical models, such as pca models. These models use a low-dimensional subspace to represent the shape of the corresponding body part and are also able to measure the plausibility of such a shape. Such a statistical model requires a labeled database that shows the body part under consideration in many different shapes that are related to the motion to be observed. However, the observed shape differences are often the result of a combination of different factors, such as anatomical and pose related variations. In these situations, a multilinear model might be used to capture those different types of variations separately.Vasilescu (2002) built such a model from motion capture data of several subjects. She used the obtained knowledge to recognize actions and to synthesize new motions. Bolkart and Wuhrer (2015) successfully used a multilinear model to analyze faces in motion.
Another human body part whose motions are of interest is the tongue, which as one of the main human articulators plays an important role in speech production. In speech science, it is therefore of great interest to understand its shape and motion during human articulation.
As mentioned earlier, acquired knowledge about tongue shape and motion can be used to synthesize new motion: for example, it can be used with virtual avatars for multimodal spoken interaction to provide a more natural animation of the tongue. Here, we note that it is of vital importance to synthesize the correct motion for the produced speech: McGurk and MacDonald (1976) found that inconsistencies between visible mouth motions and audible speech may cause the speech to be perceived incorrectly. Moreover, tongue motion synthesis can be applied in computer-aided pronunciation training to provide the user with visual information on how to move the tongue to produce a specific sound (Engwall, 2008).
However, we know that it would be beneficial to have access to a statistical model of the tongue for analyzing its motion. This is due to the sparseness of motion capture data of the tongue acquired, e.g., by ema or real-time mri.
We notice that the shape of the tongue during speech varies by a combination of at least two factors: on one hand, related to the anatomy of the speaker, and on the other, based on the speech related tongue pose. This observation motivates the usage of a multilinear model.
We remark that we need data about the tongue shape during speech production to derive such a model. However, most of the articulators are contained inside the human mouth and therefore partially or completely hidden from view. This means that traditional imaging modalities based on light, e.g. photography, are of limited use for acquiring the desired shape information. Currently, mri can be regarded as the state-of-the-art technique for investigating the interior of the human vocal tract during speech. It is non-invasive and non-hazardous to the subject, and in contrast to ultrasound or ema, it is able to provide dense volumetric measurements of the vocal tract. Moreover, there is work on adapting the mri method to the needs of speech research: here, advances have been made to improve the measurement time (Kim et al., 2009) and quality of the acquired scans (Lingala et al., 2016).
The measured mri data only contains raw image data and has to be further processed to extract the desired shape information. In our case, a suitable shape representation is given by a polygon mesh. Such a representation offers the advantage that it can be directly used in different fields. For example, in computer graphics such meshes are used to generate animations of complex objects (Botsch et al., 2010, Ch. 9) or to model objects of highly complex geometry and topology. Furthermore, polygon models have been used in speech processing to generate acoustical simulations (Blandin et al., 2015). More importantly, they have been used to perform a statistical analysis of a class of shapes, like for example human bodies (Allen et al., 2003), faces (Blanz and Vetter, 1999), or tongues (Badin and Serrurier, 2006).
Ideally, the extraction process should be at least minimally supervised, as doing it manually takes a lot of time and might require anatomical expertise.
Afterwards, the collection of estimated meshes can be analyzed to derive our desired multilinear model.
In literature, a lot of research has focused on analyzing the tongue shape during speech production. The work by Engwall (2000), Badin and Serrurier (2006), and Fang et al. (2016) used 3D mri data of a single speaker to analyze the speech related shape variations by using pca (or lca in the case of Engwall). They annotated the contour of the tongue manually in the scan data and used a mesh as shape representation.
There also exists research that aims at analyzing the anatomical and speech related shape variations separately: Harshman et al. (1977) investigated these variations in 2D X-Ray data. We note that this image modality is nowadays no longer used for this purpose due to its known negative health effects. Analysis on 2D mri was conducted by Hoole et al. (2000); Ananthakrishnan et al. (2010), and Vargas et al. (2012b, a). Finally, Zheng et al. (2003) performed this analysis on sparse sets of 65 points that were manually extracted from 3D mri scans.
On the whole, we see that there are still some open issues: previous work focused only on 2D data or sparse 3D data to analyze the anatomical and speech related variations. This sparse data representation might not be sufficient to capture the whole complex structure of the tongue. Initial work investigating these variations in 3D meshes obtained from mri data of 9 speakers was presented by Hoole et al. (2003), but neither evaluated nor published . Moreover, work that focused on the speech related shape variations of a more dense 3D representation of the tongue required manual annotation of the used mri data, which makes it infeasible for large collections of data. Here, work exists that aims at facilitating the tongue shape extraction from mri data. However, such approaches are often limited because they are restricted to 2D (Peng et al., 2010; Eryildirim and Berger, 2011; Raeesy et al., 2013), produce only a low-level volume segmentation (Lee et al., 2013), or require an anatomical expert to provide the tongue templates (Harandi et al., 2014).
In this paper, we present an significant extension of our previous work (Hewer et al., 2014). Its contributions can be summarized as follows:
We propose a minimally supervised framework for extracting tongue meshes from 3D mri data. It is minimally supervised in the sense that a user only has to annotate a few landmarks in the scan data, which significantly reduces the burden on the user compared to annotating the entire tongue surface. Originally, it combined an image segmentation technique and a template matching approach to achieve that goal. Here, we add an image denoising method to the framework in order to deal with possibly corrupt data. Moreover, we modify the template matching approach to better handle volumetric point clouds. Furthermore, we integrate a strategy for restoring missing tongue surface information that occurs due to contact between hard palate and tongue. This improvement increases the number of tongue shape configurations we can register. Finally, the framework is augmented by making use of a bootstrapping strategy, which refines the quality of the obtained shape meshes.
All these modifications allowed us to register speech related tongue shapes of 11 speakers that we used to derive a multilinear statistical model that captures almost the entire complex 3D surface geometry of the tongue and allows the anatomy and pose related variations to be modified separately.
We examined the obtained model with respect to its compactness, generalization, and specificity properties. In the case of the specificity analysis, we investigated the surface parts of the tongue mesh that play an important role during human articulation. The results of our experiments motivated us to choose a model with 5 degrees of freedom for the anatomy and 4 for the speech related tongue pose. Moreover, we successfully used the obtained model for tracking motion capture data of the tongue.
The remainder of the paper is organized as follows: in the next section, we start by describing how surface information of the vocal tract can be extracted from a given 3D mri scan by denoising it and applying an image segmentation approach. We proceed by discussing the modified template matching approach in Section 3 and also present the used templates of our approach. Section 4 is dedicated to describing how we estimate a tongue mesh from the surface information by using the template fitting. Here, we present the bootstrapping strategy used and our approach to restore missing tongue surface information that is caused by contact between tongue and hard palate. Next, we turn to the multilinear tongue model in Section 5. In this section, we outline how the acquired mesh collection can be aligned to only contain speech and anatomy related tongue shape variations and how the model is derived. We then turn to the evaluation of our approach in Section 6 where we apply it to mri scans of two datasets. Furthermore, we conduct experiments to evaluate the compactness, generalization, and specificity properties of the acquired model. In Section 7, we use the model for tracking speech related motion capture data of a new speaker. Finally, we conclude in Section 8 and outline possible future work.
As a first step, we want to extract a point cloud from an mri scan that contains the surface points and the associated normals of the major articulators and related tissue. We use a purely geometric representation of this surface information because it is easy to combine two point clouds into a single one. This is helpful in situations where we want to restore missing information in a point cloud that is present in another cloud .
As we are using image processing methods, we briefly describe how we treat a volumetric mri scan as a 3D image. We may represent an mri scan as a function
where and are real values. Here, is a discrete rectangular domain consisting of the sample positions where the scanner took the measurements. These coordinates are arranged on a regular grid where we have the grid spacings and . We say that represents the measured nmr111correlated with hydrogen molecule density at sample position .
This scan can be interpreted as a gray-scale 3D image
by applying a quantization operator to the nmr values that scales them to a standard gray-scale. Here, we decided to use a standard visualization where bright and dark indicate a high and low nmr, respectively.
Figure 1 shows two typical visualizations of such a representation: a sagittal slice and a coronal one showing an -plane and a -plane of the scan image respectively. As in general the original scan fov contains much more information than just the vocal tract, we usually crop it to a selected roi. This reduces the memory requirements and the processing time of our framework.
By inspecting the scan, we observe that the data is degraded due to measurement noise. As a remedy, we apply a 3D variant of edge-enhancing diffusion (Weickert, 1998) to the image. An example result of the approach can be inspected in Figure 1. We see that the noise was removed and structural information like edges were preserved and enhanced.
We now want to extract a point cloud of the desired surface information from the denoised mri scan. First, we detect the spatial support of the region whose surface information we want to derive. That is, we want to find a partition
such that contains the region of the major articulators and related tissue and everything else. By inspecting the denoised data, we notice that tissue can be distinguished from non-tissue, such as air and bone, by using color information. This observation motivates the use of image segmentation methods that make use of such a feature. In our case, we decided to use the method by Otsu (1979) to perform this task as it is fully automatic.
An example segmentation can be seen in Figure 2.
As we are interested in the shape information of the surface, we proceed by extracting the surface points of the tissue from the obtained partition. We call a surface point if at least one of its neighbors is part of . Additionally, we use the partition to estimate normal information at the extracted surface points.
The obtained surface points and associated normals are then assembled in a point cloud. An example of such a point cloud can be inspected in Figure 2.
Next, we want to estimate the surface of the desired articulator from such a point cloud . Here, we use a polygon mesh as the surface representation. The set with is called the vertex set of the mesh. The other set, , is the face set of our mesh.
We observe that a point cloud is a loose collection of points. In general, this collection contains more information than the desired articulator and there might be holes in the cloud with missing data. However, a subset of implicitly represents the surface of the desired articulator.
In order to identify this subset and estimate the articulator shape from it, we can apply a template fitting technique.
Given a template mesh that resembles the desired articulator and a point cloud , it finds a set where is a rigid body motion for the vertex , such that the deformed mesh with is near the point cloud data .
The template matching finds this set of deformations by minimizing the energy
The data term is minimized if applying to the mesh moves it towards some points in the point cloud. The smoothness term penalizes deformations that alter the original shape of the template. Finally, the landmark term produces energy if correspondences between landmarks on the mesh and user-provided points are violated by the deformation.
As Equation 4 is not differentiable, it is usually optimized by minimizing a series of energies where . We note that each energy uses adapted weights and :
where and are set by the user.
Originally, we used a standard heuristic(Allen et al., 2003; Li et al., 2009) to distinguish valid data observations from invalid ones in the optimization of . In particular, we say that is a valid data point candidate for a deformed vertex if the Euclidean distance between both is not too large and if their normals do not differ too much from each other.
We have modified this nearest neighbor heuristic somewhat: we collect all valid data point candidates within a fixed radius and then select the best candidate that lies below the current mesh surface. If no such candidate exists below the surface, we will select the best one above it. This modification is intended to prevent the template mesh from getting stuck at unrelated points in the volumetric cloud during the optimization. An example showing the benefits of this modified heuristic can be inspected in Figure 3. Here, we note that we are showing the projection of the matched template on the scan data for the sake of visibility and interpretability.
In our framework, we use two templates: one for the tongue and one for the hard palate. Both templates were extracted from mri data by means of a medical imaging software. Afterwards, we made the templates symmetric to remove this particular bias towards the original speaker.
The palate template consists of vertices and faces with an average edge length of . The tongue template contains vertices and faces with an average edge length of . We note that the tongue template is missing the sublingual part of the tongue that is negligible for speech production.
Both templates can be inspected in Figure 4 together with the landmarks used.
We first estimate the palate shape for each mri scan. This shape information is needed in some cases to restore tongue surface information that is missing due to contacts between tongue and palate.
First, we select a scan for each speaker where the hard palate is clearly visible and perform template matching. We note that, in general, using a single template might produce sub-optimal results in some matching cases. In order to improve the results, we set up an iterative bootstrapping approach. In each iteration, we first compute a pca model of the palate (Hewer et al., 2015) by using the results of the previous iteration. This model is then fitted to each point cloud and the results are afterwards used as the initialization for the template matching.
After we acquired the hard palate mesh for each considered speaker, we want to align this mesh to each scan of the corresponding speaker. This procedure serves the purpose of restoring tongue surface information that is missing due to contacts between tongue and palate as shown in Figure 5.
Here, we have to address the issue that the corresponding speaker might have moved between the scans. Fortunately, as the hard palate can only undergo rigid body transformations, we only have to estimate this type of motion. However, as the palate surface information might be partly missing, we fall back to color information for this task. To this end, we define the color profile set of a mesh in a scan . A profile
is a vector such that its entries are given by
where is a mesh vertex, its corresponding normal, and the chosen sampling distance. We see that we start above the palate surface in order to avoid taking samples in the possible contact area between tongue and palate.
Then, we can estimate the rigid body motion for aligning a palate mesh obtained from a scan to a scan by maximizing the energy:
where is the index set of the vertex set , the normalized cross-correlation between its operands, and the transformed mesh. We decided to use the ncc as a similarity measure because it is known to be robust against noise and brightness differences. Furthermore, the ncc between color profiles was already successfully used in a nearest neighbor heuristic for template matching (Harandi et al., 2014). A result of this alignment approach can be seen in Figure 5.
We now inject this aligned palate mesh information into the point cloud of the corresponding scan in order to restore missing tongue surface information by using the palate surface as a replacement. Additionally, we use the aligned mesh as a boundary to remove points in the point cloud above the palate that are unrelated to the tongue. Finally, we use a template matching to extract the tongue shape from the corresponding modified point cloud. As in the palate case, we use a bootstrapping strategy to refine the results.
This time, we use a multilinear model in each iteration as a statistical prior that is described in the next section. Effects of this bootstrapping operation can be seen in Figure 6.
Having obtained a collection of tongue meshes, we then want to derive a function
where is a set of meshes.
The set consists of coordinates that describe a speaker’s anatomical tongue shape. The set contains coordinates that determine the shape for a specific speech related tongue pose. Here, we call the speaker subspace and the pose subspace of the model. Meshes should have the same face set as our tongue template mesh. Their vertex sets , however, may differ from the original template with respect to their vertex positions.
Deriving the function in Equation 9 implies we want to analyze only the anatomical and speech related variations in our mesh collection, which means we have to remove all other variations present. The Procrustes alignment technique (Dryden and Mardia, 1998) is a method suitable for this task as it may be used to remove any translational and rotational differences among the meshes in the collection. However, applying this technique directly to the acquired tongue meshes might destroy critical information, e.g., related to the speech related tongue pose. This is, for example, due to the fact that the tongue also undergoes translational and rotational motions because it is connected to the lower jaw.
As a remedy, we apply the Procrustes alignment to the hard palate meshes we obtained earlier to remove translational and rotational differences between the speakers that are unrelated to the tongue motion. The results are afterwards used as a reference to align the tongue meshes. To this end, we use a speaker’s palate mesh that was earlier aligned to the corresponding scan. Here, we then estimate the rigid transformation that maps this aligned palate mesh to its Procrustes variant and apply the same motion to the corresponding tongue mesh. By doing so, we remove any translational and rotational differences related to head motions or position differences without destroying any speech or anatomy specific information.
Finally, we have to ensure that for each speaker the meshes for all selected poses are available. Here, we reconstruct a missing pose shape of a speaker by averaging available data: first, we compute the average shape of all meshes that are present for the speaker. Afterwards, we compute the mean shape of all meshes that are available for this specific pose from the other speakers. Finally, both meshes are averaged again. We note that there exist more sophisticated methods to restore missing information, such as HALRTC (Liu et al., 2013). In our case, however, this averaging approach was sufficient.
In order to derive our desired function in Equation 9, we need to analyze the anatomical and speech related variations separately. In several works (Harshman et al., 1977; Hoole et al., 2000, 2003; Ananthakrishnan et al., 2010; Vargas et al., 2012b, a; Zheng et al., 2003), the PARAFAC method (Harshman, 1970)
was used to perform this analysis. This method, also known as CANDECOMP, decomposes a tensor into a sum ofrank- tensors where is provided by the user. Therefore, this technique can be regarded as an extension of the singular value decomposition to tensors. However, literature reports issues with this method: Hoole et al. (2000) found that it might be difficult to find reliable solutions. Vargas et al. (2012a) pointed out that the PARAFAC decomposition requires numerous components to describe the observed data in a satisfactory way, which limits its usefulness as a dimensionality reduction method. Moreover, De Silva and Lim (2008) discovered that the associated standard approximation problem is mathematically ill-posed, which can lead to the problem of diverging components in a numerical setting.
Another suitable method is the Tucker decomposition (Tucker, 1966) that is sometimes also called hosvd. This method computes the orthonormal spaces of a tensor associated with its modes. It may be regarded as a more flexible variant of the PARAFAC method (Kiers and Krijnen, 1991) and has previously been used to analyze 2D tongue shape data (Vargas et al., 2012b).
To avoid the issues of PARAFAC, we decided to use the Tucker decomposition to analyze our data. Here, we follow the approach of Bolkart and Wuhrer (2015) who used it to analyze the variations of human faces in different expressions. To this end, we first turn our tongue meshes into feature vectors by serializing the vertex sets into vectors . Then, we compute the mean , and center the vectors. Afterwards, we organize those centered vectors in a tensor . Here, we refer to the first mode of the tensor as the speaker mode where represents the number of speakers, to the second mode as pose mode with being the amount of different tongue poses, and to the third mode as the vertex mode with representing the dimension of the vectors .
The hosvd makes use of the fact that can be decomposed as follows:
In our case, the row vectors of are coordinates in our speaker space that determine the anatomical shape for each of the original speakers. A similar observation applies to where the row vectors are coordinates in the pose space . The tensor is the core tensor of the decomposition that acts as a link between and . The operation is called the -th mode multiplication of the tensor with the matrix .
The core tensor is the multilinear model we can use to create our function in Equation 9: essentially, given and , we can use to generate serialized vertex sets that represent the generated shape as follows:
By letting be the vertex set reconstructed from , we finally can define our function as:
where is the face set of our original template. We remark that the dimensionality of the speaker and pose subspace can be truncated to remove shape variations that may be considered negligible or related to noise. This means that our subspaces have dimensionalities and .
We can use this derived model to register data, for example a point cloud . This time, we want to optimize for the parameters and that best describe the speaker anatomy and tongue pose that is represented in the data. To this end, we minimize the following energy:
where the data and landmark terms are equivalent in their modeling idea to their counterparts in the template matching case. Furthermore, we use the same nearest neighbor heuristic and optimization approach as in the template matching. This time, the weights for both terms remain constant during the optimization of the energy series. However, we note that if the correct neighbors are known, they can be set directly and only one energy has to minimized in that case.
It is common to limit the admissible values for and to avoid highly unlikely shapes. In particular, we limit each entry of and individually to an interval
is the standard deviation of the corresponding variation in the used mesh collection andthe corresponding entry of the mean coordinate in the respective subspace. Finally, is a scale factor.
We note that the above energy can also be used to fit a pca model: in this case, the energy depends only on one parameter.
Our next goal is to apply the described framework to mri data and evaluate the quality of the obtained tongue model.
The Ultrax project consists of static mri scans of 11 adult speakers of British English where 7 are female and 4 are male. All speakers are phonetically trained and were recorded while sustaining the vocal tract configuration for different phones. For each speaker, 13 speech related scans are available that correspond to the phone set [i, e, E, a, A, 2, O, o, u, 0, @, s, S].222the given notation uses the ipa
The Baker dataset was recorded as part of the Ultrax project, but released separately. It contains 25 scans of one male speaker that are speech related and depict different articulatory configurations.
The data was recorded at the Clinical Research Imaging Centre in Edinburgh using a Siemens Verio 3T scanner; the scans were acquired with an echo time of and a repetition time of . The individual scans consist of 44 sagittal slices with a thickness of and a slice size of pixels. Here, we have as grid spacings and .
For our analysis, we decided to exclude one speaker of the Ultrax dataset that showed a high activity of the soft palate, which caused problems in our framework. Furthermore, we use the whole phone set that was recorded for the Ultrax data. However, we note that the Baker dataset is lacking scans for the phones [a, O, 0, @, s, S] where the shape information has to be reconstructed.
In total, we are using the shape information of 11 speakers with 13 different tongue shape configurations. This means that we arrive at a tensor where the dimension of the vertex mode is related to the vertex count of the tongue template we are using.
For this data, the following settings were applied in our framework to extract the mesh collection:
In the case of template matching, we used , , , and . Thus, we start with a high weight for the smoothness and landmark terms to drive the template to the correct neighborhood at the beginning of the optimization. The template matching for the tongue used to damp the effects of falsely placed landmarks. We used for the palate matching to ensure that its extremities were correctly aligned. For the model fitting that is applied during the bootstrapping, we used . In the nearest neighbor heuristic, we set the search radius to and limited the maximally allowed angle difference between the normals to 60 degrees. The optimization for the template matching used a series of 40 energies, the one for the model fitting applied a series of 10 energies to find the minimizer. For the palate alignment, we decided to use sufficiently long profiles with a length of and a sampling distance of .
In the bootstrapping strategy, we applied iterations until a satisfactory visual result was obtained: we used one iteration for the hard palate and 5 iterations for the tongue. For the scale factor in the model fitting, we used for the tongue and for the palate in order to prevent overfitting.
The landmarks needed for the hard palate and the tongue were placed on the mri scans by a user who is not an anatomical expert.
It is common to evaluate such statistical models by analyzing their compactness, generalization, and specificity (Styner et al., 2003) in order to find the optimal subspace dimensionality.
Compactness investigates how much the individual components of and contribute to the description of the used training data. In Figure 8, we see that using is sufficient to represent of data variability. Approximately the same holds for .
Generalization measures how well the model can represent data that was not part of the training. To evaluate the speaker generalization, we designed the following experiment: the data of each speaker was once excluded from the training set. The derived model was then used to register this excluded data where we measured the average Euclidean distance between the registered mesh and the original one. Additionally, we analyzed the fitting results for different values of . The dimensionality of the pose subspace was fixed to during these experiments to prevent overfitting caused by this subspace. In the analysis of the pose generalization, we used the same approach. In this case, the dimensionality of the speaker subspace was fixed to . The results of these experiments are depicted in Figure 8. During this evaluation, we used the scale factor in the model fitting optimization.
The specificity tries to assess how much the generated tongue shapes of the model differ from the original training data. In particular, we wanted to investigate how large these differences were for the regions of the tongue mesh that are speech related. Figure 7 shows an overview of those regions. To this end, we designed a few experiments where samples from the two subspaces were drawn randomly in order to generate a tongue shape. The first experiment investigated the specificity of the speaker subspace. Here, the pose subspace is again fixed to and the speaker subspace size was varied. For each value of , we generated random tongue shapes and evaluated the average Euclidean distance between the created mesh and the closest one in the mesh collection. In this comparison and distance evaluation, a region consisting of all speech related parts was considered. The same experiment was conducted for analyzing the specificity of the pose subspace where the speaker subspace size was set to . The results of both experiments can be inspected in Figure 8.
Finally, we wanted to find out how much the tongue shapes belonging to specific phones differ from the corresponding ones generated by the model. Here, for each phone we performed the following experiment: we froze the coordinates in the pose subspace to the ones belonging to the given phone. Moreover, we only allowed the generated meshes to be compared to meshes belonging to that phone. Then, for each dimensionality of the speaker subspace, we generated samples and computed the average Euclidean distance to the closest mesh. This time, we used in the distance evaluation and comparison parts of the tongue that are considered critical for this specific phone (Jackson and Singampalli, 2009). For the vowels [i, e, E, a, A, 2, O, o, u, 0, @], we selected a region consisting of the tongue blade, tongue back, and tongue dorsum. The area for the sibilants [s, S] contains the tongue tip and the tongue blade. The results of these experiments are shown in Figure 9.
In all specificity experiments, we generated samples.
The performed experiments provide an interesting insight into the model properties. The results of the generalization experiments show that only a few components of and are needed to reliably register unseen data. In particular, for , 3 components are enough to reach an average error that is slightly above the measurement precision of the mri scan data. For , 7 components are needed to reach this level of precision. Furthermore, we observe that a high number of components leads to errors below the measurement precision of the scan data, which can be considered as overfitting. Here, we observe that the pose subspace has better generalization abilities than the speaker subspace. We suspect this might be related to redundancies in our training data: for example, the phone pairs [2, O], [e, i], and [e, E] are similar to each other with respect to shape (Ladefoged, 1982). This means that excluding one still provides the model with enough information to capture the related variation.
Moreover, we notice that the phone  shows a significantly bad result in the fixed phone specificity evaluation, which might be related to its unusual role in the phonology of British English. We suspect that some speakers might have pronounced it inconsistently and applied different strategies, which led to a high variation in the data that is then integrated into the model.
Overall, we decided that setting and provides a good compromise between specificity, generalization, and compactness. We note that this choice also limits the effects of overfitting.
After having derived our final model, we investigated if it could be used to reliably track the tongue motion capture data of an unknown speaker and to generate a plausible animation from it. To this end, we decided to use ema data from a previous study (Steiner et al., 2014). In broad terms, ema uses an electromagnetic field to track the position of coils that are attached to specific points of interest, e.g., on the tongue surface. This modality can provide data with a high temporal resolution, but only gives access to a sparse set of points.
We selected the following data of the female subject VP05 in the dataset: one recording that contains repeated consonant-vowel combinations of the consonants [f, s, S, ç, x, K, m, n, N, l] and the vowel [u]. Furthermore, we used a recording of the German translation of the “The Northwind and the Sun” passage, a standard specimen in phonetic research (Association, 1999). We see that we are facing the task of registering dynamic speech data of an unknown speaker that also contains new phonemes.
The raw ema data was prepared as follows: first, we smoothed the data to dampen any high frequency measurement noise. Afterwards, we removed rigid motion originating from head movements by using the three reference coils of the ema data that were attached to suitable positions on the head. Finally, we used the palate shape and the bite plane of the subject to rigidly align the data to our tongue model in a semi-supervised way.
For our experiments, we chose the 3 coils that were placed at the tongue tip, the tongue blade, and the tongue dorsum, which lay roughly in the mid-sagittal plane of the tongue. We normalized this data, i.e., we projected the positional data into the mid-sagittal plane to guarantee this mid-sagittal property.
For our tracking experiments, we first had to find for each ema coil a corresponding vertex on our tongue mesh. We used the following semi-supervised approach to determine these correspondences from one frame of the used ema data: first, we sample a random tongue shape from our model and find initially for each coil the nearest neighbor on the mid-sagittal area of the tongue mesh. Then, we iteratively refine these correspondences by fitting the model and updating the nearest neighbors. We repeat the above two steps multiple times and keep the correspondences that achieve the smallest average distance between coil positions and their corresponding vertices. Afterwards, we visually compare the proposed correspondences with a photographic reference of the subject’s tongue with the attached coils and rerun the above approach until the correspondences are plausible. During the sampling and the correspondence optimization, we used the scale factor to avoid overfitting.
In our experiments, we used the following energy to fit our model to the current ema data frame:
The data term measures the distances between the vertex locations and their corresponding ema coil positions. The bias term penalizes deviations from the mean weights of the model. We added this term to the energy to provide the approach with information about the average tongue shape in order to cope with the sparsity of the data. Finally, the smoothness term favors a temporal coherence between consecutive frames.
For all experiments, we used and to provide a good compromise between these model ideas. Furthermore, we set the scale factor to to give the approach a lot of freedom during the optimization.
In the first experiment, we optimized for and . However, we know that the anatomy of the speaker should remain constant during the recordings. As it is unknown, we used a common approach to estimate the corresponding weights: we averaged the obtained anatomy weights of the first experiment.
For the second experiment, we only optimized for and fixed to the estimated anatomy weights.
For all ema frames of the results, we computed the distribution of the weights and also the cumulative error. The error in each frame was calculated by measuring the average Euclidean distance between vertex location and corresponding coil position. The error is shown in Figure 10 and the weight distributions can be inspected in Figure 11. Additionally, we created for each experiment a video for the “The Northwind and the Sun” passage. Such a video shows the (anonymized) speaker during recording, an animation of the fitted tongue model together with the actual ema coil positions, and information about the current weights. Both videos can be found in the supplementary material: the video file corresponding to the first experiment is named “VP05_full.mp4”, the other one is named “VP05_fixed_anatomy.mp4”. Note that we show normalized versions of the model weights, i.e., they are shifted and scaled such that represents the value where and have the same roles as in Equation 14.
We observe in the results for the first experiment that we achieve acceptable errors where of the errors are below . Additionally, we see that all weights are used during the tracking approach, which also means that the anatomy is often also adapted to improve the fitting result. Here, we see that their values approximately stay within the interval
. Moreover, we notice that weight 4 of the tongue pose is showing a significantly higher variance than the other pose weights.
Moving to the second experiment, we notice that the errors increased. However, they are still acceptable: of the errors are still below . This development can be seen as an expected behavior because the approach now only has 4 degrees of freedom instead of 9 to fit the data. Moreover, we see that the variance of the tongue pose weights increased. We might argue here that optimizing all weights like in the first experiment causes the pose weights to be underestimated. Again, the fourth weight is showing a higher variance than the others. We suspect that this high variance and high range of achieved values might be caused by unknown shape variations in the data.
By inspecting the video material, we notice that in the first experiment the tongue is sometimes visually changing its anatomical properties: for example, it is shrinking or expanding. However, this is an expected behavior because the anatomy weights were also optimized to improve the result. By fixing the anatomy and optimizing only the tongue pose in the second experiment, we seem to avoid these problems: the anatomy of the tongue seems to stay stable and the transitions between frames also appear more plausible.
Thus, we conclude that the second approach produces more acceptable results than the first one despite the slight loss of precision. Moreover, we see that this approach also provides us with information about transition paths in the tongue pose subspace between phonemes. We observe that these obtained transition paths can be used to transfer the tracked motion to another speaker by adjusting the anatomy weights accordingly.
In this work, we presented a multilinear tongue model that was derived from volumetric mri scans in a minimally supervised way. In particular, we saw during the experiments that a model with a low dimensionality can reliably register unknown data with an acceptable precision. Moreover, we explored in two experiments if the model that was acquired from static data was suitable for tracking sparse motion capture data of the tongue. Here, we saw that fixing the anatomical features of the model to speaker specific shape provided the most acceptable results. Moreover, this variant also provided us with the option of performing a motion transfer from one speaker to another. However, we also discovered indications that the current multilinear model might be missing some shape variations.
In the future, we plan to investigate whether more shape variations can be obtained using more data. To this end, we want to use additional datasets in our framework. This implies that we also have to extract the shapes of phones like [g, k] that are characterized by contact with the soft palate. Here, we have to address the issue of recovering in the corresponding scans the surface of the soft palate, which can deform in a non-rigid way. Additionally, the datasets we use might differ with respect to the recorded phones, which leads to missing data in our training set. In this case, the simple averaging method for reconstructing missing shapes is no longer sufficient. Furthermore, using more data also increases the risk of encountering mislabeled or corrupt scans.
Moreover, we want to explore whether the derived model can be used to extract realistic 3D tongue motions from real-time 2D mri data that was recorded in the mid-sagittal plane. In contrast to ema, this modality provides us with much more motion information. Here, it would also be interesting to investigate different types of dynamic speech, e.g., whispering, shouting, or expressive speech. Ultimately, this could lead to a multilinear model that is able to synthesize these different types of speech.
This study uses data from work supported by EPSRC Healthcare Partnerships (grant number EP/I027696/1). The work itself was funded by the German Research Foundation (grant number EXC 284).