As part of ongoing research in developing a fully data-driven av tts synthesizer , we integrate a tongue model to increase visual intelligibility and naturalness. To extend the kinematic paradigm used for facial animation in the synthesizer to tongue animation, we adapt state-of-the-art techniques of animation with motion-capture data for use with ema.
Our av synthesizer111http://visac.loria.fr/ is based on a non-uniform unit-selection tts system for French , concatenating bimodal units of acoustic and visual data, and extending the selection algorithm with visual target and join costs . The result is an application whose gui features a “talking head” (i.e. computer-generated face), which is animated synchronously with the synthesized acoustic output.
This synthesizer depends on a speech corpus acquired by tracking marker points painted onto the face of a human speaker, using a stereoscopic high-speed camera array, with simultaneously recorded audio. While the acoustic data is used for waveform concatenation in a conventional unit-selection paradigm, the visual data is post-processed to obtain a dense, animated 3D point cloud representing the speaker’s face. The points are interpreted as the vertices of a mesh, which is then rendered as an animated surface to generate the face of the talking head using a standard vertex animation paradigm.
Due to the nature of the acquisition setup, no intra-oral articulatory motion data can be simultaneously captured. At the very least, any invasive instrumentation, such as ema wires or transducer coils, would have a negative effect on the speaker’s articulation and hence, the quality of the recorded audio; additional practical issues (e.g. coil detachment) would limit the length of the recording session, and by extension, the size of the speech corpus. As a consequence, the synthesizer’s talking head currently features neither tongue nor teeth, which significantly decreases both the naturalness of its appearance and its visual intelligibility.
To address this shortcoming, we develop an independently animated 3D tongue and teeth model, which will be integrated into the talking head and eventually controlled by interfacing directly with the tts synthesizer.
2 ema-based tongue model animation
To maintain the data-driven paradigm of the av synthesizer, the tongue model222For reasons of brevity, in the remainder of this paper, we will refer only to a tongue model, but it should be noted that that such a model can easily encompass upper and lower teeth in addition to the tongue. consists of a geometric mesh rendered in the gui along with (or rather, “behind”) the face. Since the primary purpose of the tongue model is to improve the visual aspects of the synthesizer and it has no influence on the acoustics, there is no requirement for a complex tongue model to calculate the vocal tract transfer function, etc. Therefore, in contrast to previous work [e.g. 14, 5, 8, 6, 17, 10], most of which attempts to predict tongue shape and/or motion by simulating the dynamics in one form or another, we must merely generate realistic tongue kinematics, without having to model the anatomical structure of the human tongue or satisfy physical or biomechanical constraints.
This scenario allows us to make use of standard animation techniques using motion capture data. Specifically, we apply ema using a Carstens AG500333Carstens Medizinelektronik GmbH, http://www.articulograph.de/ to obtain high-speed (), 3D motion capture data of the tongue during speech .
While other modalities might be used to acquire the shape of the tongue while speaking, their respective drawbacks make them ill-suited to our needs. For example, ultrasound tongue imaging tends to require extensive processing to track the mid-sagittal tongue contour and does not usually capture the tongue tip, while real-time mri has a very low temporal resolution, and is currently possible only in a single slice.4443D cine-mri of the vocal tract , while possible, is far from realistic for the compilation of a full speech corpus sufficient for tts.
2.1 Tongue motion capture
Conventional motion capture modalities (as widely used e.g. in the animation industry) normally employ a camera array to track optical markers attached to the face or body of a human actor, producing data in the form of a 3D point cloud sampled over time. For facial animation, these points (given sufficient density) can be directly used as vertices of a mesh representing the surface of the face; this is the vertex animation approach taken in the av synthesizer (see above).
For articulated body animation, however, the 3D points are normally used as transformation targets for the rigid bones of a hierarchically structured (usually humanoid) skeleton model. Much like the strings controlling a marionette, the skeletal transformations are then applied to a virtual character by deforming its geometric mesh accordingly, a widely used technique known as skeletal animation.
Since current ema technology allows the tracking of no more than 12 transducer coils (usually significantly fewer on the tongue), the resulting data is too sparse for vertex animation of the tongue surface. For this reason, we adopt a skeletal animation approach, but without enforcing a rigid structure, since the human tongue contains no bones and is extremely deformable. This issue is addressed below.
One advantage of ema lies in the fact that the data produced is a set of 3D vectors, not points, as the AG500 tracks the orientation, as well as position, of each transducer coil. Thus, the rotational information supplements, and compensates to some degree for the sparseness of, the positional data. Technically, this corresponds to motion capture approaches such as , although the geometry is of course quite different for the tongue than for a humanoid skeleton.
As a small ema test corpus, we recorded one speaker using the AG500, with the following measurement coil layout: tongue tip center, tongue blade left/right, tongue mid center/left/right, tongue back center, lower incisor, upper lip (reference coils on bridge of nose and behind each ear). The exact arrangement can be seen in Figure 2. The speech material comprises sustained vowels in the set [i, y, u, e, ø, o, @, a], repetitive CV syllables permuting these vowels with the consonants in the set [p, t, k, m, n, N, f, T, s, S, ç, x, l, ], as well as 10 normal sentences in German and English, respectively. A 3D palate trace was also obtained.
We imported the raw ema data as keyframes into the animation component of a fully-featured, open-source, 3D modeling and animation software suite,555Blender v2.5, http://www.blender.org/ using a custom plugin. Unlike point cloud based motion capture data contained in industry standard formats such as C3D , this allows us to directly import the rotational data as well. As an example of the result, one frame is displayed in Figure 2. Within each frame of the animation, the ema coil objects can provide the transformation targets for an arbitrary skeleton.
Once the motion capture data has been imported into the 3D software, it can be segmented into distinct actions for use and re-use in nla. This allows us to manipulate and concatenate any number of frame sequences as atomic actions, and to synthesize new animations from them, using e.g. the 3D software’s nla editor (which, for these purposes, is conceptually similar to a gestural score in articulatory phonology ).
2.2 Tongue model animation
In order to use the tongue motion capture data to control a tongue model using skeletal animation, we design a simple skeleton as a rig for the tongue mesh. This rig consists of a central “spine”, and two branches to allow (potentially asymmetric) lateral movement, such as grooving. Once again, it must be pointed out that this rig is unrelated to real tongue anatomy, although it could be argued that e.g. the spine corresponds roughly to the superior and inferior longitudinal muscles.
Of course, a skeleton of rigid bones is inadequate to mimic the flexibility of a real tongue. Our solution is to construct the rig using deformable bones, so-called bb, which can bend, twist, and stretch as required, governed by a set of constraining parameters.
The tongue model should be able to move independently of any specific ema coil layout, since after all, the motion capture data represents observations of tongue movements based on hidden dynamics. For this reason, and to maintain as much modularity and flexibility as possible in the design, the animation rig is not directly connected to the ema coils in the motion capture data. Instead, we introduce an adaptation layer in the form of “struts”, each of which is connected to one coil, while the other end serves as a target for the rig’s bb. These struts can be adapted to any given ema coil layout or rig structure.
With the struts in place and constrained to the movements of the ema coils, the rig can be animated by using ik to determine the location, rotation, and deformation of each bb for any given frame. The ik are augmented by volume constraints, which inhibit potential “bloating” of the rig during bb stretching.
The final component for tongue model animation is a mesh that represents the tongue surface, which is rendered in the gui and deformed according to the skeletal animation. While this tongue mesh could be an arbitrary geometric structure, we use an isosurface extracted from a volumetric scan in a mri speech corpus (from a different speaker; voxel size ). The tongue in this scan was manually segmented using a graphics tablet and open-source medical imaging software.666OsiriX v3.9, http://www.osirix-viewer.com/
The resulting tongue mesh was manually registered to the ema coil positions in a neutral bind pose. The skeletal rig was then embedded, and vertex groups in the tongue mesh assigned automatically to each bb. As the motion capture data animates the rig using ik, the tongue mesh is deformed accordingly, approximately following the ema coils. Figure 3 displays the initial and final frame in one cycle of repetitive [ta] articulation in the ema test corpus. In an informal evaluation, our technique appears to produce satisfactory results, and encourages us to pursue and refine this approach to tongue model animation.
3 Discussion and Outlook
We have presented a technique to animate a kinematic tongue model, based on volumetric vocal tract mri data, using skeletal animation with a flexible rig, controlled by motion capture data acquired with ema, and implemented with off-the-shelf, open-source software. While this approach appears promising, it is still under development, and there are various issues which must be addressed before the tongue model can be integrated into our av tts synthesizer as intended.
Upper and lower teeth can be added to the model using the same data and animation technique, albeit with a conventional, rigid skeleton. These will then be rendered in the synthesizer’s gui along with the face and tongue.
The tongue mesh used here is quite rough, and registration with the ema data does not produce the best fit, owing to differences between the speakers’ vocal tract geometries and articulatory target positions, quite possibly exacerbated by the effects of supine posture during mri scanning [e.g. 9]. A more suitable mesh might be obtained by scanning the tongue of the same speaker used for the ema motion capture data, at a higher resolution.
Registration of the tongue mesh into the 3D space of the tongue model should be performed in a partially or fully automatic way, using landmarks available in both mri and ema modalities [cf. 1], such as the 3D palate trace and/or high-contrast markers at the positions of the reference coils.
The reliability of ema position and orientation data is sometimes unpredictable. This could be due to the algorithms used to process the raw amplitude data, faulty hardware, interference (even within the coil layout itself), or any combination of such factors. However, since any such errors are immediately visible in the animation of the tongue model by introducing implausible deformations, we are working on methods both to clean the ema data itself, and to make the tongue model less susceptible to such outlier trajectory segments.
To evaluate the performance of the animation technique, factors such as skin deformation and distance of ema coils from the tongue model surface should be monitored. The structure of the skeletal rig can be independently refined, optimizing its ability to generate realistic tongue poses. Its embedding into the tongue mesh should preferably be performed using a robust automatic method [e.g. 2].
The 3D palate trace can be used to add a palate surface mesh to the tongue model. For both the palate and the teeth, the model could also be augmented with automatic collision detection by accessing the 3D software’s integrated physics engine.777Bullet physics library, http://www.bulletphysics.com/
For an interactive application such as the av synthesizer gui, it is impractical to incur the performance overhead of an elaborate 3D rendering engine, especially when a non-trivial processing load is required for the bimodal unit-selection. Instead, we anticipate integrating the tongue model into the talking head using a more lightweight, real-time capable 3D game engine, which may even offload the visual computation to dedicated graphics hardware. The advantage of using keyframe-based, nla actions is that they can be ported into such engines as animated 3D models, using common interchange formats.888For instance, COLLADA (http://www.collada.org/) or OGRE (http://www.ogre3d.org/) Although the skeletal rig could be accessed and manipulated directly, this “pre-packaging” of animation actions also avoids the complexity, or perhaps even unavailability, of advanced features such as bb or ik in those engines.
The final integration challenge is to interface the tongue model directly with the tts system to synthesize the correct animation actions with appropriate timings. This task might be accomplished using a diphone synthesis style approach, or even action unit-selection, and will be addressed in the near future.
We owe our thanks to Sébastien Demange for assistance during the recording of the ema test corpus, and to Korin Richmond for providing the means to record the mri data used here.
Aron, M., A. Toutios, M.-O. Berger, E. Kerrien, B. Wrobel-Dautcourt . Y. Laprie:
Registration of multimodal data for estimating the parameters of an articulatory model. . Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), . 4489–4492, Taipei, Taiwan, April 2009. IEEE.
-  Baran, I. . J. Popović: Automatic rigging and animation of 3D characters. . Proc. 34th International Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), San Diego, CA, USA, August 2007. ACM.
-  Browman, C. . L. Goldstein: Articulatory Phonology: An overview. Phonetica, 49(3-4):155–180, 1992.
-  Colotte, V. . R. Beaufort: Linguistic features weighting for a text-to-speech system without prosody model. . Proc. Interspeech, . 2549–2552, Lisbon, Portugal, September 2005. ISCA.
-  Engwall, O.: Combining MRI, EMA & EPG measurements in a three-dimensional tongue model. Speech Communication, 41(2-3):303–329, October 2003.
-  Gerard, J.-M., P. Perrier . Y. Payan: 3D biomechanical tongue modeling to study speech production. . Harrington, J. . M. Tabain (.): Speech Production: Models, Phonetic Processes, and Techniques, . 10, . 149–164. Psychology Press, New York, NY, May 2006.
-  Kaburagi, T., K. Wakamiya . M. Honda: Three-dimensional electromagnetic articulography: A measurement principle. Journal of the Acoustical Society of America, 118(1):428–443, July 2005.
-  King, S. A. . R. E. Parent: Creating speech-synchronized animation. IEEE Transactions on Visualization and Computer Graphics, 11(3):341–352, May/June 2005.
-  Kitamura, T., H. Takemoto, K. Honda, Y. Shimada, I. Fujimoto, Y. Syakudo, S. Masaki, K. Kuroda, N. Oku-Uchi . M. Senda: Difference in vocal tract shape between upright and supine postures: Observations by an open-type MRI scanner. Acoustical Science and Technology, 26(5):465–468, 2005.
-  Lu, X. B., W. Thorpe, K. Foster . P. Hunter: From experiments to articulatory motion – a three dimensional talking head model. . Proc. Interspeech, . 64–67, Brighton, UK, September 2009. ISCA.
-  Molet, T., R. Boulic . D. Thalmann: Human motion capture driven by orientation measurements. Presence: Teleoperators and Virtual Environments, 8(2):187–203, April 1999.
-  Motion Lab Systems: The C3D File Format User Guide. Baton Rouge, LA, USA, 2008.
-  Musti, U., V. Colotte, A. Toutios . S. Ouni: Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer. . Proc. 10th International Conference on Auditory-Visual Speech Processing (AVSP), Volterra, Italy, September 2011. ISCA.
-  Pelachaud, C., C. van Overveld . C. Seah: Modeling and animating the human tongue during speech production. . Proc. Computer Animation, . 40–49, Geneva, Switzerland, May 1994. IEEE.
-  Takemoto, H., K. Honda, S. Masaki, Y. Shimada . I. Fujimoto: Measurement of temporal changes in vocal tract area function from 3D cine-MRI data. Journal of the Acoustical Society of America, 119(2):1037–1049, February 2006.
-  Toutios, A., U. Musti, S. Ouni, V. Colotte, B. Wrobel-Dautcourt . M.-O. Berger: Towards a true acoustic-visual speech synthesis. . Proc. 9th International Conference on Auditory-Visual Speech Processing (AVSP), . POS1–8, Hakone, Kanagawa, Japan, September 2010. ISCA.
-  Vogt, F., J. E. Lloyd, S. Buchaillard, P. Perrier, M. Chabanas, Y. Payan . S. S. Fels: An efficient biomechanical tongue model for speech research. . Proc. 7th International Seminar on Speech Production (ISSP), . 51–58, Ubatuba, Brazil, December 2006.