Speech Map: A Statistical Multimodal Atlas of 4D Tongue Motion During Speech from Tagged and Cine MR Images

by   Jonghye Woo, et al.
Harvard University

Quantitative measurement of functional and anatomical traits of 4D tongue motion in the course of speech or other lingual behaviors remains a major challenge in scientific research and clinical applications. Here, we introduce a statistical multimodal atlas of 4D tongue motion using healthy subjects that enables a combined quantitative characterization of tongue motion in a reference anatomical configuration. This atlas framework, termed speech map, combines cine- and tagged-MRI in order to provide both the anatomic reference and motion information during speech. Our approach involves a series of steps including (1) construction of a common reference anatomical configuration from cine-MRI, (2) motion estimation from tagged-MRI, (3) transformation of the motion estimations to the reference anatomical configuration, (4) correction of potential time mismatch across subjects, and (5) computation of motion quantities such as Lagrangian strain. Using this framework, the anatomic configuration of the tongue appears motionless, while the motion fields and associated strain measurements change over the time course of speech. In addition, to form a succinct representation of the high-dimensional and complex motion fields, principal component analysis is carried out to characterize the central tendencies and variations of motion fields of our speech tasks. Our proposed method provides a platform to quantitatively and objectively explain the differences and variability of tongue motion by illuminating internal motion and strain that have so far been intractable. The findings are used to understand how tongue function for speech is limited by abnormal internal motion and strain in glossectomy patients.



There are no comments yet.


page 3

page 11

page 14


A Sparse Non-negative Matrix Factorization Framework for Identifying Functional Units of Tongue Behavior from MRI

Muscle coordination patterns of lingual behaviors are synergies generate...

Functionality-Driven Musculature Retargeting

We present a novel retargeting algorithm that transfers the musculature ...

A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images

Real-time magnetic resonance imaging (RT-MRI) of human speech production...

Temporal Registration in Application to In-utero MRI Time Series

We present a robust method to correct for motion in volumetric in-utero ...

Dynamic imaging using motion-compensated smoothness regularization on manifolds (MoCo-SToRM)

We introduce an unsupervised deep manifold learning algorithm for motion...

Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks

This study investigates the speech articulatory coordination in schizoph...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human tongue is a muscular hydrostat [1] and is considered to be a complex biomechanical system comprised of numerous intrinsic and extrinsic muscles. In the course of speech or other lingual behaviors, the human tongue takes on a variety of positions, shapes, and local deformations created by the complex interactions of its inter-digitated muscles [2]. Understanding the interactions of these components is crucial for many applications, such as speech production or swallowing, and often requires anatomical and motion information ranging from voxel level to muscle level; integrative models of tongue anatomy and physiology are important for understanding the mechanisms of speech production as well as disease and planning intervention. Visualization and quantification of tongue motion during speech or swallowing through medical imaging such as ultrasound or magnetic resonance imaging (MRI) have been used for past decades, yet combined quantitative measurement of functional and anatomical features in a common reference space still remains an unmet goal.

Figure 1: Example of speech data (“-suk”) from a healthy control in (a) and a glossectomy patient in (b) at four representative time frames. Cine- and tagged-MRI are shown in the first and second rows, respectively. Note that both modalities are in the same spatio-temporal coordinate space.

Numerous studies have been conducted to understand the function of the tongue to date. Research interest in tongue motion during speech or swallowing using MRI has grown in recent years as technology has the potential to elucidate mechanisms of tongue motion related disorders. In particular, it is critical to obtain the 4D (3D space with time) motion information about speech movements to understand and model the speech production process. Multimodal MR imaging and subsequent image analysis has allowed us to investigate the multifaceted nature of tongue structure and function. For instance, structural MRI and diffusion MRI [3, 4] provide an exquisite depiction of internal muscular architecture and local fiber orientations, respectively. In addition, real-time MRI provides the ability to examine real-time changes in the vocal tract shaping in speech production [5, 6]. Furthermore, tagged-MRI [7, 8] has allowed us to track the internal motion of the tongue in addition to motion on its surface. For example, Fig. 1 illustrates an example of speech data from cine- and tagged-MRI for a normal control and a glossectomy patient, showing different motion patterns at different time points during speech. However, it is challenging to contrast and compare the differences between the two subjects as visually assessed. Therefore, the development of meaningful quantitative measurements from these imaging techniques facilitates quantitative comparison across controls and diseased populations. In addition, the usefulness of these methodologies is dependent on the development of meaningful spatio-temporal indices of tongue function and of techniques for integrating complementary information across these imaging modalities.

Although MRI has played a pivotal role in tongue image and motion analysis, advancement in speech science research is hampered by the lack of tools for accurate, reproducible, and automated quantitative characterization of tongue motion during speech in a common reference space in order to examine tongue motion and its variability. This is partly because the size, shape, or motion pattern of the tongue during speech may vary from one subject to another, yet there is no comprehensive and systematic framework to examine the difference and variability in a common reference space. For instance, most of the work using tagged-MRI have been carried out to analyze subject-specific internal tongue motion patterns and therefore it is difficult to objectively and quantitatively compare different motion patterns, especially patient motion patterns. Despite its potential for analyzing internal tongue motion data in an objective manner, there has been very little research toward an atlas of 4D tongue motion during speech. Two key works related to the development of such a 4D atlas are [9] and [10]. Wedeen et al. [9] introduced a new data acquisition strategy within a Lagrangian framework to analyze myocardial strain-rates of the heart. That work was performed on a subject-by-subject basis to analyze motion patterns using strain rates. Woo et al. [10] proposed a 4D atlas of the tongue during speech from cine-MRI using diffeomorphic groupwise registration. In that work, only cine-MRI was used to generate the average motion pattern of the tongue during speech. The present work is motivated by both approaches in the sense that the average speech movements using healthy speakers are constructed in a single anatomical configuration, the common reference space. We use both cine- and tagged-MRI to characterize motion quantities during speech in an objective and quantitative manner. The Speech Map uses a fixed anatomical reference space from cine-MRI which characterizes speech patterns from tagged-MRI. The way it works is that tagged-MRI from a specific speaker are overlaid to an atlas reference space that provides vocal tract tissue motion patterns and structural boundaries. The individual speaker’s motion patterns are then registered and analyzed in the space, thus allowing comparisons between subjects with different properties such as motion fields and strains.

In this work, we present a novel approach to combining imaging information by providing a sequence of images in which the tongue anatomical configuration appears frozen [9], but in which each voxel location displays the displacement and associated motion quantities of a fixed tissue element in a common reference space. This approach allows us to quantitatively assess differences in speech movements across speakers and speech-impaired populations. In contrast to the previous methods of tongue imaging in spatial coordinates [10, 12], our approach provides movies of the tongue in material coordinates. In addition, each subject’s tongue motion is transformed to the common anatomical configuration, thereby allowing us to objectively compare patients to the atlas and to each other. To the best of our knowledge, this is the first attempt at constructing a statistical multimodal atlas of 4D tongue motion during speech using both cine- and tagged-MRI within a Lagrangian framework. Using this framework, we demonstrate that accurate characterization of 4D tongue motion during speech is possible, thereby establishing the normal motion patterns and associated quantitative measures in a reference configuration. We also provide data demonstrating the application of the technique for understanding the mechanisms of tongue impairment in an individual’s abnormal speech due to glossectomy.

The remainder of this paper is structured as follows. In Section 2, prior work on 4D multimodal atlas construction is reviewed. The atlas building method for the tongue during speech is presented in Section 3. In Section 4, we describe experimental results. A detailed discussion is presented in Section 5 and finally conclusions and future directions are given in Section 6.

2 Related Work

In this section, we review multimodal 4D atlas construction methods. 4D atlas construction has been an area of active research in recent years. The ability to construct a representative 4D atlas of a population is an important tool in the analysis and interpretation of medical images in organs such as the heart, the fetal brain, the tongue, and the lung. 4D atlases provide changes in anatomical references or particular features of an object over time. 4D atlases become 4D statistical or probabilistic atlases when they represent the differences within a population of subjects [10]. 4D atlases have numerous applications in medical image analysis. For example, 4D atlases representing normal growth or motion patterns can be used to detect abnormalities or potential disease by measuring the variation of a subject relative to the variations contained in the atlas. In addition, they can also provide an a priori information for the segmentation [11] and registration of anatomical structures. Compared with static 3D multimodal atlas construction methods or 4D atlas construction methods that use a single modality, the multimodal 4D atlas construction problem is challenging as both nonlinear mappings in both different modalities and motion modeling need to be performed cooperatively, which may be in a sequential or joint manner.

Several approaches have been proposed to construct such 4D atlases. Xing et al. [12] proposed a 4D multimodal atlas of the tongue during speech within an Eulerian framework. In the present work, we also used both cine- and tagged-MRI, but adopted a Lagrangian framework, where the reference anatomic configuration remains fixed, while motion fields change in the course of speech. Puyol-Anton et al. [13]

presented a multimodal cardiac motion atlas construction method, in which both MRI and ultrasound were used to construct the atlas. In that work, high-quality tagged-MRI data were first used to form an atlas, and patient data from ultrasound were related to the tagged-MRI based atlas. The approach for embedding the displacement vector was based on principal component analysis (PCA), which could be improved by using nonlinear manifold learning methods. Furthermore, a 4D statistical atlas construction method was presented to build a swine heart atlas from PET-CT images 

[14]. In that work, the data were acquired from a hybrid PET-CT scanner and spatially co-registered PET and CT data were assumed. A hierarchical normalization method was then used to progressively construct the atlas from anatomic images (i.e., CT angiography) to functional images (i.e., PET). In the present work, both cine- and tagged-MRI are in the same spatio-temporal coordinate system and therefore the nonlinear mappings learned in one modality can be used to map the other modality. In related developments, Wang et al. [15]

proposed a joint segmentation and registration method to model 4D changes in pathological anatomy across time by providing an explicit mapping of a healthy normative template. In that work, because a normative template cannot deal with pathological appearance for the joint segmentation and registration, they used different options for initialization via a supervised and semi-supervised learning and transfer learning approach for the application of traumatic brain injury.

3 Materials and Methods

3.1 Data Acquisition

3.1.1 Data Collection and Preprocessing

In our study, speech MRI data were collected from fourteen healthy subjects and two glossectomy patients (native speakers). Subjects repeated a pre-trained speech task (i.e., “-suk”), where the speech task lasts for 1 second, when cine- and tagged-MR images were acquired as a sequence of image frames at multiple parallel slice locations that cover a region of interest encompassing the tongue and the surrounding structures. We used a segmented k-space data acquisition scheme, which is not real-time but requires a number of repetitions [17] to obtain a good representation of tongue motion associated with the speech task. We used T2-weighted multi-slice 2D dynamic cine- and tagged-MRI data at a frame rate of 26 frames per second using a Siemens 3.0 T Tim Treo system (Siemens Medical Solutions, Malvern, PA) with 12-channel head and 4-channel neck coil. To avoid the blurred effect caused by involuntary motion such as swallowing, three orthogonal stacks from axial, coronal, and sagittal orientations were acquired to cover the whole tongue. Each dataset had 6 mm slice thickness and 1.8 mm in-plane resolution. Other sequence parameters included repetition time (TR) 36 ms, echo time (TE) 1.47 ms, flip angle 6

, and turbo factor 11. Super-resolution volume reconstruction was then used to create a single volume by combining all three stacks with an isotropic resolution 

[18]. For tagged-MRI, the acquisition matrix was the same as cine-MRI and complementary spatial modulation of magnetization (CSPAMM) was applied. The datasets had 26 frames per second with a temporal resolution of 36 ms for each phase with no delay from the tagging pulse, 6 mm thick slices (12 mm sinusoidal tag period), and 1.875 mm in-plane resolution with no gap. The field-of-view was 24 cm. Note that both cine- and tagged-MRI were in the same spatio-temporal coordinate system.

3.1.2 Subjects and Speech task

The atlas is constructed from a database of cine- and tagged-MR images of fourteen healthy native subjects. The sample population includes both males and females with ages ranging from 21 to 57. Table 1 lists detailed information on age, weight, and gender included in the atlas construction. The MRI speech task is the phrase “-suk.” The word begins with a neutral tongue position (schwa). The tongue body motion is simple because it moves only forward or backward, and the phrase uses little to no jaw motion, thus increasing tongue deformation. There are four distinctive time frames //, /s/, /u/, and /k/ in that phrase.

Subjects Age Gender Weight (lb) Subjects Age Gender Weight (lb)
1 23 M 155 8 21 F 126
2 31 F 150 9 37 M 150
3 24 F 100 10 22 M 130
4 57 F 170 11 43 M 180
5 43 F 217 12 26 M 240
6 35 M 210 13 42 F 180
7 45 F 180 14 52 M 156
Table 1: Detailed characteristics of the fourteen healthy subjects
Figure 2: Flowchart of the proposed Speech Map

3.2 4D Multimodal Atlas Construction Method

This section introduces our method to construct the 4D multimodal atlas using both cine- and tagged-MRI. Our approach involves a series of steps illustrated in Fig. 2.

3.2.1 Construction of Reference Atlas from Cine-MRI

The first step is to create an atlas of the reference time frame (i.e., time frame 1) from cine-MRI, which serves as a reference anatomical configuration. For images, given a set of points in the common coordinates of the tongue in the undeformed material time frame, , and a set of points in each image space , , the goal of the atlas construction is to find a set of diffeomorphic mappings , each of which transforms any point in each individual image space to a corresponding point in image : ]. We use an unbiased groupwise diffeomorphic registration approach with a cross-correlation similarity metric [20]. The atlas creation procedure combines a groupwise affine registration as an initial transformation, followed by a groupwise deformable registration. We then get both forward (i.e., ) and inverse (i.e.,) mappings during the atlas building process.

3.2.2 Motion Estimation from Tagged-MRI

The phase vector incompressible registration algorithm (PVIRA) [19]

is used for 3D motion estimation from tagged-MRI. In brief, PVIRA computes a dense 3D motion field at each time frame via a series of steps described as follows. It interpolates 2D tagged slices into 3D volumes and then uses a harmonic phase filter 

[21] to produce a series of phase volumes from interpolated 3D volumes. The phase volume pairs from axial, sagittal, and coronal directions are then processed via an incompressible iterative image registration framework [22]. We denote these phase volumes by , , , , , and , where , , and stand for axial, sagittal, and coronal tag directions and the two time frames are and . The symmetric velocity field update of PVIRA, defined at each voxel of the three volumes, is defined as:


where is the normalization factor and , , and are given by


Note that a wrapping operator and the modified gradient operator used here are given by




The forward and inverse deformation fields that are incompressible and diffeomorphic are given by


In PVIRA, incompressibility is only enforced where tissue is defined by using a combined HARP magnitude image, which is computed from the three HARP magnitude images that are found at the same time as the HARP phase images. An inverse motion field is produced along with the forward motion field, enabling both an Eulerian and a Lagrangian output between the undeformed static tongue and its deformed state. For a full description of the motion estimation pipeline, the readers can refer to the work [8].

Figure 3: Schematic illustration of the transformation of the motion fields derived in the subject space to the atlas space.

3.2.3 Transformation of Motion Estimation

The PVIRA motion fields derived from tagged-MRI are diffeomorphic mappings defined in the coordinate system of subject : . The transformation of the subject-specific motion fields into the atlas coordinate system is given by [23]


where is a subset of pixels including the tongue region in the atlas space and denotes the transformed motion field of subject . Fig. 3 illustrates this transformation process, where two subject-specific diffeomorphic motion fields indicated by blue and red arrows are transformed using the diffeomorphic deformation fields learned during the reference atlas construction process.

3.2.4 Computation of Motion Quantities

Lagrangian strain tensors are computed from PVIRA in the subject coordinate system or atlas coordinate system to reflect the tongue tissue’s local deformation at every time frame. We denote the estimated motion field by

, where is the coordinates of the tongue in the undeformed material time frame. The deformation gradient tensor can be computed as . The Lagrangian strain tensor is defined as


which used in the undeformed material frame to evaluate how much a given displacement differs locally from a rigid body displacement. The eigen-decomposition of the Lagrangian strain yields three principal directions and values (E1, E2, and E3; E1 E2 E3) that indicate the directions and amount of the major extension, the second major deformation (either stretching or compression), and the major compression, respectively. In addition, we use the mean of the magnitude of the motion field, which is defined as


where denotes the magnitude of a motion field and is the bounding box encompassing the tongue region, which is defined in the reference time frame as


3.2.5 Statistical Analysis Using PCA

We perform PCA on the Lagrangian motion fields in the four distinctive time frames, //, /s/, /u/, and /k/, across subjects. The goal of our statistical motion model using PCA is to form a succinct representation of the high-dimensional motion data by building a model of a class of internal motion patterns given a set of examples of the internal motion patterns during speech. All computations are performed in the atlas coordinate system. The procedure of PCA is to approximate matrix using a linear model of the form similar to [10]:


where is the mean of the Lagrangian motion fields for all subjects


is the parameter vector, and is the Lagrangian motion field of subject in the atlas space. The columns of the matrix are formed by the principal components (PC) of the covariance matrix


4 Experiments and Results

In this section, we present results of experiments on the in vivo tongue data, which demonstrates the performance of the proposed method. All programs were implemented using either C++ or MATLAB. In our experiments, we primarily focused on the analysis of the four representative time frames of the sounds //, /s/, /u/, and /k/.

Figure 4: Plot of the reference atlas using the first time frame from cine-MRI and our final atlas motion fields from “-suk.” Note that motion fields are rooted in the material coordinates in the reference atlas space.

Fig. 4 depicts the reference atlas constructed using cine-MRI, which serves as a reference anatomical configuration, and average motion fields of “-suk.” The motion fields of each time frame are not biased by any specific subject’s anatomic features, thereby allowing us to characterize and compare the motion fields directly at each voxel. Each time frame describes representative motion fields of the classes of each phoneme. In the present work, we used the manually picked time frames for the time alignment step for accurate quantitative analysis.

4.1 Tongue motion and strain analysis in the subject space

Figure 5: Subject-specific Lagrangian speech motion fields (first row) are shown with the third Lagrangian strain direction and magnitude (second row) of a healthy subject (top) and a glossectomy patient (bottom) at four representative time frames. The third row shows the fixed anatomic configuration of each subject. The third Lagrangian strain in the second row indicates compression of each tissue point. Note that the cone size and color indicate magnitude and direction (red for left-right, green for front-back, and blue for up-down), and all the analyses are performed in the subject space.

The motion fields, the directions, and magnitudes of the third Lagrangian strain (E3) indicating compression (2D mid-sagittal only) are depicted in Fig. 5 for the four time frames for a healthy control and a glossectomy patient. Please note that in this Lagrangian framework, in the course of speech, the tongue configuration appears frozen as shown in Fig. 5 (bottom row, cine MR image), but each voxel location depicts the speech movements and associated strain of a fixed tissue element. For the normal control in Fig. 5 (left), in the production of //, the genioglossus muscle is compressed slightly more than other regions are as indicated by the blue color. In the productions of /s/ and /u/, the tip, front, and root of the tongue are compressed as indicated by the blue color, while in the production of /k/, the back of the tongue is markedly compressed. Please note that although we have full 3D plus time Lagrangian strains and motion fields, we only show a 2D mid-sagittal slice and therefore the interpretation could be limited. For the glossectomy patient in Fig. 5 (right), the motion is different from that of the normal control, in which compressed tongue regions are rather unpredictable due to the tongue resection. For instance, twisting motion pattern is observed in /s/, while most of the tongue regions, the tip, and floor of the tongue are compressed in /u/ and /k/ as reflected in the third Lagrangian strain (E3), respectively, as visually assessed.

Lagrangian Strains E1 E2 E3
// 0.1130.057 -0.0030.005 -0.0890.032
/s/ 0.1830.080 -0.0030.007 -0.1000.035
/u/ 0.2490.120 0.0080.017 -0.1100.044
/k/ 0.2510.122 0.0090.015 -0.1140.044
Table 2: Lagrangian strains averaged over the whole tongue for normal controls in the atlas space (meanSD)
Figure 6: Lagrangian speech motion fields (first row) in the atlas space are shown with the third Lagrangian strain direction (second row) and magnitude (third row) of a healthy subject (top) and a glossectomy patient (bottom) at the four representative time frames. The third row shows the fixed anatomic configuration of the atlas. The third Lagrangian strain in the second row indicates compression of each tissue point. Note that the cone size and color indicate magnitude and direction (red for left-right, green for front-back, and blue for up-down).

4.2 Tongue motion and strain analysis in the atlas space

Figure 7: Plots of the three Lagrangian strain values from the glossectomy patient versus the mean of the three Lagrangian strain values from the normal controls in the atlas space. Please note that the values are derived from the whole tongue. It is shown that the values from the glossectomy patient is higher than the mean of the values from the fourteen normal controls.

Fig. 6 shows the same two subjects used in Fig. 5 transformed to the atlas space, where the motion fields, directions, and magnitudes of the third Lagrangian strain (E3) are depicted in the first, second, and third rows, respectively. The directions and patterns of the third Lagrangian strain (as in the second and third rows) and all the Lagrangian strain values between the subject and atlas space are similar and highly correlated (r=0.99, =NS), respectively, indicating that the transformations used in Eq. (6) preserve the properties of strain in the subject space. Since our proposed Lagrangian framework uses the same material coordinates, it is easy to compare and contrast the differences of strains across subjects. In addition, Table 2

lists the mean and standard deviation of the Lagrangian strains (E1, E2, and E3) averaged over the whole tongue for all normal controls in the atlas space. Fig. 

7 depicts the three Lagrangian strain values of a glossectomy patient as in Fig. 6 (right) versus the mean of the normal controls as in Table 2. For this glossectomy patient, it is shown that all three Lagrangian strains of the patient are elevated compared to those of the normal controls, suggesting that this patient requires more tissue compression or expansion throughout the whole tongue to produce target sounds.

4.3 PCA analysis on motion fields

The PCA was performed on the speech motion data following the transformation of the motion fields (i.e., PVIRA) of each subject to the atlas space based on cine-MRI. For all four time frames, the mean motion fields appear to be reasonable representative motion fields of the classes of each phoneme as shown in Fig. 8

when visually assessed. Our PCA results indicate that there are no dominant tongue behaviors in the atlas space as reflected by the variance of different PCs, where the first three PCs for all the time frames accounted for less than 45

as shown in Table 3. This is partly because the PCA analysis done in the atlas space captures the differences and variability in speech motion after the size and shape differences of the tongue across subjects were corrected to minimize the effect of anatomical differences. The PCA results thus capture subtle yet objective motion differences in lingual motion across speakers. First, in the production of //, the first PC, which accounts for 19.53%, seems to capture forward and backward motion in the front part of the tongue, while the second PC, which accounts for 14.03%, roughly captures downward and backward motion and the third PC, which accounts for 9.47%, captures the rotating motion. Second, in the production of /s/, the first PC, which accounts for 16.51%, appears to differentiate between apical /s/ (-1) and laminal /s/ (+1), in which apical /s/ is produced with the tip of the tongue, while laminal /s/ is produced with the blade of the tongue. The second and third PCs, which account for 13.08% and 9.95%, capture forward and upward for the second PC, and upward and forward motion for the third PC in the lower half of the tongue, respectively. Third, in the productions of /u/ and /k/, the first PC, which accounts for 17.68% and 17.95%, captures the different use of the tongue tip and blade as in /s/. The second PC, which accounts for 12.21% and 12.42%, explains the use of the front and back of the tongue, while the third PC, which accounts for 11.58% and 11.05%, contrasts upward (-1) and forward (+1) motion for /u/ and upward (-1) and forward (+1) motion for /k/, respectively.

Figure 8: The three primary PCs of variance of the motion fields from “-suk.”
// 19.53 14.03 9.47
/s/ 16.51 13.08 9.95
/u/ 17.68 12.21 11.58
/k/ 17.95 12.42 11.05
Table 3: PC loadings for the four time frames (%)

5 Discussion

5.1 Summary of Results

In this work, we present a novel approach to visualizing and analyzing tongue motion during speech by constructing a statistical multimodal atlas of 4D tongue motion within a Lagrangian framework. Integrative models of tongue anatomy and physiology using multimodal tongue imaging play an important role in characterizing tissue function and properties. Our atlas framework based on a normal population is an advantage to studying the mechanisms of speech production in normal and diseased populations including tongue cancer and brain disorders such as amyotrophic lateral sclerosis (ALS) or stroke. We report several new findings that would have been difficult to obtain with existing methods. First, unlike other approaches (e.g., [10, 12]), within a Lagrangian framework, we show for the first time the directions and magnitudes of the Lagrangian strain and internal motion patterns on a subject-by-subject basis and further in the atlas space. In the subject-specific material coordinate system, it is possible to compare internal motion and strain patterns across different sounds, while in the atlas material coordinate system, it is possible to compare the internal motion and strain patterns across subjects in an objective and quantitative manner. In this way, we can create a motion map that is not biased by a specific individual’s anatomical and functional features. Second, this approach allows us to capture motion variability by providing a fixed coordinate space for motion analysis using PCA. For instance, PCA analysis on /s/ production captures the difference between apical and laminal /s/. Third, in studying patients, our approach can be used to compare abnormal behaviors in relation to normal motor variation using internal motion fields and strain in the atlas space. This atlas space allows us to compare motion information ranging from voxel level to muscle level. As shown in Table 2, we established the average values of the strain in the whole tongue with which to compare against strain values of patient data.

5.2 A comparison of Lagrangian and Eulerian frameworks

In the present work, we chose to use a Lagrangian configuration rooted in the material coordinates. Since the material coordinate system does not experience any deformation in every following time frame and all quantities are mapped back to this static configuration, the process significantly simplifies comparison of the strain patterns during speech across subjects. Therefore, a “motionless” concept [9] is established in the context that all motion is converted to the form of changing value of specific variables.

Another strategy is to use similar processes applied to an Eulerian configuration rooted in the spatial coordinates, such as reported in [12]. Such a “moving atlas” allows for visualization of the moving tongue anatomy parallel to the cine-MRI in the form of changing motion fields during speech. The strength of such an Eulerian atlas is that it reflects the real deforming properties of the tongue shape and motion fields, since the shape changes are also accounted for in the coordinate changes, which can then be directly applied to the deformation and comparison of cine-MRI. Besides, strain and other properties at a specific time instance can be directly computed without additional deformation. However, temporal analysis of any changing variable may become more difficult as coordinates of the whole tongue keep changing, and inverting tracking of any fixed tissue point is not feasible.

5.3 Time alignment of speech movements across subjects

In the present work, we focused on the analysis of the four time frames by manually picking the four time frames for each subject. In order to study the whole phrase, however, it is necessary to accurately align the speech task across subjects to build a temporally aligned 4D atlas. This is a challenging task as there is a high inter-subject variability in speech movements even after training each subject to speak to a metronome. This could be more problematic when building 4D atlases using real-time MRI such as [5, 6], since it is difficult to control an individual speaker’s tempo as opposed to the data collection strategy using repeated utterances to acquire cine-MRI with a 1-second duration as described in this work. One can tackle this problem using either motion quantities derived from tagged-MRI such as strain and the mean of the magnitude of the motion field as in Eq. (11) or motion quantities derived from cine-MRI [10, 33]. These quantities can serve as a motion descriptor to find temporal correspondences across subjects. Additionally, if speech acoustic samples that are synchronized with cine-MRI data are available, then one may consider using speech acoustic samples to find temporal correspondences across subjects using a dynamic time warping approach [29], and then apply the alignment to the imaging data.

5.4 Using Lagrangian strain as a marker of muscle activation

Internal tongue motion patterns and associated strain measurements provide a link between muscle activation and tongue surface shape. Although strain measurements indicating tissue compression and expansion can be used as a useful surrogate (or biomarker) for muscle activation, their relationship to actual muscle activation is complex and variable from one subject to another. The challenge is mainly because a large number of muscles are inter-digitated and activate in different patterns to create a deformation, leading to complex strain directions that are difficult to quantitatively and visually assess. Our atlas framework could provide new insights into the understanding of tongue muscle coordination and muscle activation by providing average motion fields and principal directions and magnitude of the strain. More studies, however, are needed using electromyography (EMG) [25] or biomechanical simulations [26] to investigate this along with our framework to a great extent.

5.5 Using diffeomorphic registration to create the 4D atlas during speech

Our 4D atlas approach relies on the accurate registration of tongue regions and associated motion fields to a common template to localize motion changes during speech. Registration may work better for normal controls than for patient data, especially in glossectomy patients, depending on the regional homogeneity in the tongue that is used to find the correspondences. Some patients have heterogeneous tissue types (e.g., glossectomies) or smaller tongues depending on different disease states (e.g., ALS). Therefore, it is necessary to perform some specialized preprocessing to reduce registration error and potential bias. For example, one can segment the tongue region and use the corresponding segmented regions to provide initial anatomical landmarks for the registration method that help localize and emphasize the region of interest.

5.6 Interfacing between motion fields and anatomy in the 4D atlas

Measuring internal motion patterns is only the first step towards explaining the intramural mechanics of the human tongue in association with physiological deformations during speech. Next one needs to put the analysis including strain measurements in the context of the tongue’s muscular anatomy derived from diffusion MRI, structural MRI [32] or structural atlases [27] to investigate muscle activation along with muscle anatomy such as muscle shortening. Multimodal registration such as [31] could be used to interface between motion fields and anatomy depending on the imaging modalities being considered. In this way, our atlas framework could allow us to compare motion information at different resolution levels ranging from voxel level to individual muscle level to muscle group level (e.g., functional units [30]). In addition, linking internal motion fields and associated strain patterns to muscle anatomy in the subject as well as atlas space has a potential to shed light on the “functional organization” such as functional units [30] of the tongue during speech.

5.7 Tracking tissue points using phase-based versus intensity-based registration approach

To track each point of the tongue from tagged-MRI, we used PVIRA in this work. PVIRA works on the harmonic phase volumes extracted from tagged-MRI, while iLogDemons works directly on the intensity volumes of the tagged data. Since harmonic phase is a physical property directly related to actual tissue location, direct matching of phase values is more reliable than matching of intensity values that are prone to tag fading and noise. In the original HARP [21], it has been demonstrated that the use of phase normally results in tracking errors less than a third of the pixel resolution, while matching of intensity is more likely to fail when tags fade at later time frames. Therefore, using PVIRA instead of iLogDemons is a natural choice to reduce the impact of intensity change over all time frames. In related work, tracking internal points of the tongue in a 2D mid-sagittal slice using HARP from tagged-MRI has showed superior performance to various intensity-based registration approaches from cine-MRI [34].

5.8 Computing statistics on diffeomorphisms to characterize individual subject’s vocal tract parameters

Since its inception of “computational vocal tract anatomy” [27], efforts have been made to understand and model not just the anatomy itself but the anatomical changes of the tongue over time and its variations across a population [28, 10]. Although this first effort only accounts for the tongue, we can expand the reference anatomic configuration and the motion fields to include the whole vocal tract. In addition, although registration of each subject with the atlas has the effect of warping the vocal tract that would not be expected to preserve articulatory-acoustic relations, the diffeomorphisms used to warp each individual subject to the atlas encode information on parameters that reveal characteristics of each individual’s vocal tract such as vocal tract area function. This is similar to the approaches used in computational anatomy [35].

6 Conclusions and Future Directions

In this work, a statistical multimodal atlas of 4D tongue motion from both cine- and tagged-MRI has been successfully constructed using healthy subjects for the speech task, “-suk.” To our knowledge, this atlas is the first of its kind, thus opening new vistas to study the relationship between structural and functional properties of the tongue during speech. In our future work, we will further investigate classification and statistical techniques for categorizing groups of subjects. For instance, in addition to PCA, deep learning based approaches 

[24] will be carried out to perform regression and classification using the multidimensional features including displacements, strain, or muscle mechanics derived multimodal imaging data to differentiate between the normal motion pattern from our atlas and pathologic motion patterns from patient populations. In addition, our approach can be broadly applied to analyze how tongue function for speech is limited by abnormal internal motion and strain in a variety of patient groups such as patients who have undergone treatment for cancer or other diseases such as aphasia or impaired language development caused by brain injury.

7 Acknowledgements

This research was supported in part by NIH R00DC012575, R01DC014717, R01CA133015, S10OD011928, NSF PHY1504804, and ECOR ISF funding. We thank Euna Lee for proofreading the text.


  • [1] Kier, W. M., Smith, K.K., “Tongues, Tentacles and Trunks: the Biomechanics of Movement in Muscular-hydrostats.” Zool. J. Linnean Soc., 83, pp. 307–324, 1985
  • [2] Gilbert, R. J., Magnusson, L. H., Napadow, V. J., Benner, T., Wang, R., and Wedeen, V. J., “Mapping complex myoarchitecture in the bovine tongue with diffusion-spectrum magnetic resonance imaging,” Biophysical journal, 91(3), pp. 1014-1022, 2006
  • [3] Gaige, T.A., Benner, T., Wang, R., Wedeen, V.J. and Gilbert, R.J, “Three dimensional myoarchitecture of the human tongue determined in vivo by diffusion tensor imaging with tractography,” Journal of Magnetic Resonance Imaging, 26(3), pp. 654-661, 2007
  • [4] Shinagawa, H., Murano, E. Z., Zhuo, J., Landman, B., Gullapalli, R. P., Prince, J. L. and Stone, M., “Tongue muscle fiber tracking during rest and tongue protrusion with oral appliances: A preliminary study with diffusion tensor imaging,” Acoustical Science and Technology, 29(4), 291-294, 2008
  • [5] Narayanan, S., Nayak, K., Lee, S., Sethy, A. and Byrd, D, “An approach to real-time magnetic resonance imaging for speech production,” Journal of Acoustical Society of America, 115(4), pp. 1771–1776 (2004)
  • [6] Fu, M., Barlaz, M. S., Holtrop, J. L., Perry, J. L., Kuehn, D. P., Shosted, R. K., Liang, Z. P., Sutton, B. P., “High-frame-rate full-vocal-tract 3D dynamic speech imaging,” Magnetic Resonance in Medicine, 77(4), pp. 1619-1629, 2016
  • [7] Parthasarathy, V., Prince, J.L., Stone, M., Murano, E., Nessaiver, M., “Measuring Tongue Motion from Tagged Cine-MRI Using Harmonic Phase (HARP) Processing,” Journal of Acoustical Society of America, 121(1), pp. 491–504, 2007
  • [8] Xing, F., Woo, J., Lee, J., Murano, E. Z., Stone, M. and Prince, J.L., “Analysis of 3-D Tongue Motion From Tagged and Cine Magnetic Resonance Images,” Journal of Speech Language and Hearing Research, 59(3), pp. 468–79, 2016
  • [9] Wedeen, V.J., Weisskoff, R.M., Reese, T.G., Beache, G.M., Poncelet, B.P., Rosen, B.R. and Dinsmore, R.E, “Motionless Movies of Myocardial Strain‐Rates using Stimulated Echoes,” Magnetic Resonance in Medicine, 33(3), pp. 401–408, 1995
  • [10] Woo, J., Xing, F., Lee, J., Stone, M. and Prince, J, “A Spatio-Temporal Atlas and Statistical Model of the Tongue During Speech from Cine-MRI,” Journal of Computer Methods in Biomechanics and Biomedical Engineering (CMBBE): Special Issue on Imaging and Visualization, pp. 1–12, 2016
  • [11] Ibragimov, B., Prince, J., Murano, E., Woo, J. Stone, M., Likar, B., Pernus F., Vrtovec, T., “Segmentation of Tongue Muscles from Super-Resolution MR Images,” Medical Image Analysis, 20(1), pp. 198–207, 2015
  • [12] Xing, F., Prince, J. Stone, M., Wedeen, V. El Fakhri, G., Woo, J., “A Four-dimensional Motion Field Atlas of the Tongue from Tagged and Cine Magnetic Resonance Imaging,” SPIE Medical Imaging, Orlando, Florida, Feb., 2017
  • [13] Puyol-Antón, E., Sinclair, M., Gerber, B., Silvia, M. S., Langet, H., De Craene, M. Aljabar, P. Piro, P., King, A. P., “A multimodal spatiotemporal cardiac motion atlas from MR and ultrasound data,” Medical Image Analysis, 40, pp. 96-110, 2017
  • [14] Woo, J., Normandin, M., Guehl, N., Wooten, D., Brady, T. Baghdady, R., Shoup, T., Ouyang, J. El Fakhri, G. and Alpert, N., “4D Multimodal Atlas of the Swine Heart from PET-CT Images ,” Society of Nuclear Medicine, San Diego, CA, 2016
  • [15]

    Wang, B., Prastawac, M., Irimiad, A., Sahae, A., Liua, W., G., S. Y. Matthew, Vespaf, P. M., Van Hornd, J., Gerig, G., “Modeling 4D pathological changes by leveraging normative models,” Computer Vision and Image Understanding, 151, pp. 3-13, 2016

  • [16] Woo, J., Xing, F., Lee, J., Stone, M. and Prince, J., “Construction of An Unbiased Spatio-Temporal Atlas of the Tongue During Speech,” Inf Process Med Imaging. 24, pp. 723-32 (2015)
  • [17] McVeigh, E. R. and Atalar, E., “Cardiac tagging with breath-hold cine MRI,” Magn Reson Med, 28(2), pp. 318–27, 1992
  • [18] Woo, J., Murano, E.Z., Stone, M. and Prince, J. L., “Reconstruction of high-resolution tongue volumes from MRI,” IEEE Transactions on Biomedical Engineering, 59(12), pp. 3511–3524, 2012
  • [19] Xing, F., Woo, J., Gomez, A. D., Pham, D. L., Bayly, P. V., Stone, M. and Prince, J. L., “Phase Vector Incompressible Registration Algorithm (PVIRA) for Motion Estimation from Tagged Magnetic Resonance Images,” IEEE Trans Med Imaging, 36(10), pp. 2116–2128, 2017
  • [20] Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A. and Gee, J.C., “A reproducible evaluation of ANTs similarity metric performance in brain image registration,”Neuroimage, 54(3), pp. 2033–2044, 2011
  • [21] Osman, N.F., Kerwin, W.S., McVeigh, E.R. and Prince, J. L., “Cardiac motion tracking using CINE harmonic phase (HARP) magnetic resonance imaging,” Magnetic Resonance in Medicine, 42(6), pp.1048–1060, 1999
  • [22] Mansi, T., Pennec, X., Sermesant, M., Delingette, H. and Ayache, N., “iLogDemons: A Demons-Based Registration Algorithm for Tracking Incompressible Elastic Biological Tissues,” International Journal of Computer Vision. 92(1), pp. 92–111, 2011
  • [23] Ehrhardt, J., Werner, R., Schmidt-Richberg, A., and Handels, H., “Statistical Modeling of 4D Respiratory Lung Motion Using Diffeomorphic Image Registration,” IEEE Trans. Medical Imaging, 30(2), pp. 251–265, 2011
  • [24] LeCun, Y., Bengio, Y, and Hinton, G., “Deep learning,” Nature, 521, pp. 436–444, 2015
  • [25] Pittman, L. J., Bailey, E. F., “Genioglossus and intrinsic electromyographic activities in impeded and unimpeded protrusion tasks,” Journal of neurophysiology, 101(1), pp. 276–282, 2009
  • [26] Stavness, I., Lloyd, J. E., Fels, S., “Automatic prediction of tongue muscle activations using a finite element model,” Journal of Biomechanics, 45(16), pp. 2841–2848, 2012
  • [27] Woo, J., Lee, J., Murano, E., Xing, F., Al-Talib, M., Stone, M., Prince, J. L., “A High-resolution Atlas and Statistical Model of the Vocal Tract from Structural MRI,” Journal of Computer Methods in Biomechanics and Biomedical Engineering, 3(1), pp. 47–60, 2015
  • [28] Stone, M., Woo, J., Lee, J., Poole, T., Seagraves, A., Chung, M., Kim, E., Murano, E., Prince, J. L., Blemker, S., “Structure and Variability in Tongue Muscle Anatomy,” Journal of Computer Methods in Biomechanics and Biomedical Engineering, 2016
  • [29] Lucero, J. C., Munhall, K. G., Gracco, V. L., Ramsay, J. O., “On the registration of time and the patterning of speech movements,” Journal of Speech, Language, and Hearing Research, 40(5), 1111-1117, 1997
  • [30] Woo, J., Xing, F., Lee, J., Stone, M., Prince, J. L., “Determining functional units of tongue motion via graph-regularized sparse non-negative matrix factorization,” Medical image computing and computer-assisted intervention, 17(2), 2014
  • [31] Woo, J., Stone, M., and Prince, J. L., “Multimodal Registration via Mutual Information Incorporating Geometric and Spatial Context,” IEEE Trans on Image Processing, 24(2), pp. 757-769, February, 2015
  • [32] Xing, F., Ye, C., Woo, J., Stone, M., and Prince, J. L., “Relating speech production to tongue muscle compressions using tagged and high-resolution magnetic resonance imaging,” SPIE Medical Imaging: Image Processing, Florida, USA, February, 2015
  • [33] Fu, M., Woo, J., Liang, Z.-P., and Sutton, B. “Spatiotemporal-atlas-based dynamic speech imaging,” SPIE medical imaging, February, 2016
  • [34] Woo, J., Stone, M., Suo, Y., Murano, E. Z., and Prince, J. L. “Tissue-point motion tracking in the tongue from cine MRI and tagged MRI,” Journal of Speech, Language, and Hearing Research, 57(2), pp. 626-636, 2014
  • [35] Miller, M. I., “Computational anatomy: shape, growth, and atrophy comparison via diffeomorphisms,” Neuroimage, 23, pp. 19-33, 2004