A Deep Joint Sparse Non-negative Matrix Factorization Framework for Identifying the Common and Subject-specific Functional Units of Tongue Motion During Speech

07/09/2020
by   Jonghye Woo, et al.
0

Intelligible speech is produced by creating varying internal local muscle groupings—i.e., functional units—that are generated in a systematic and coordinated manner. There are two major challenges in characterizing and analyzing functional units. First, due to the complex and convoluted nature of tongue structure and function, it is of great importance to develop a method that can accurately decode complex muscle coordination patterns during speech. Second, it is challenging to keep identified functional units across subjects comparable due to their substantial variability. In this work, to address these challenges, we develop a new deep learning framework to identify common and subject-specific functional units of tongue motion during speech. Our framework hinges on joint deep graph-regularized sparse non-negative matrix factorization (NMF) using motion quantities derived from displacements by tagged Magnetic Resonance Imaging. More specifically, we transform NMF with sparse and manifold regularizations into modular architectures akin to deep neural networks by means of unfolding the Iterative Shrinkage-Thresholding Algorithm to learn interpretable building blocks and associated weighting map. We then apply spectral clustering to common and subject-specific functional units. Experiments carried out with simulated datasets show that the proposed method surpasses the comparison methods. Experiments carried out with in vivo tongue motion datasets show that the proposed method can determine the common and subject-specific functional units with increased interpretability and decreased size variability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 9

04/15/2018

A Sparse Non-negative Matrix Factorization Framework for Identifying Functional Units of Tongue Behavior from MRI

Muscle coordination patterns of lingual behaviors are synergies generate...
08/20/2021

deep unfolding for non-negative matrix factorization with application to mutational signature analysis

Non-negative matrix factorization (NMF) is a fundamental matrix decompos...
09/29/2017

A Nonlinear Orthogonal Non-Negative Matrix Factorization Approach to Subspace Clustering

A recent theoretical analysis shows the equivalence between non-negative...
01/24/2017

Speech Map: A Statistical Multimodal Atlas of 4D Tongue Motion During Speech from Tagged and Cine MR Images

Quantitative measurement of functional and anatomical traits of 4D tongu...
09/14/2018

Identification of multi-scale hierarchical brain functional networks using deep matrix factorization

We present a deep semi-nonnegative matrix factorization method for ident...
10/05/2020

Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks

Neural networks (NNs) whose subnetworks implement reusable functions are...
07/17/2018

Penalized matrix decomposition for denoising, compression, and improved demixing of functional imaging data

Calcium imaging has revolutionized systems neuroscience, providing the a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligible speech is produced by intricate and successful orchestration of local muscle groupings—i.e., functional units—of the extremely complex muscular architecture of the tongue (Woo et al., 2019a). The tongue is an organ that is controlled intricately by the support of its myoarchitecture, comprising an array of highly inter-digitated intrinsic and extrinsic muscles (Gaige et al., 2007)

. As a result, it is of great interest and need to study the intrinsic dimension-reduced structures of speech movements in order to better understand the mechanisms by which intrinsic and extrinsic muscles of the tongue coordinate to generate rapid yet accurate speech movements. To date, a great deal of work from different disciplines including neurophysiology, biomechanics, speech and language, and medical imaging and analysis has hypothesized and demonstrated that the control of tongue movements is governed by a reduced number of degrees of freedom 

(Gick and Stavness, 2013) that is associated with corresponding neuromuscular modules (Bizzi et al., 1991; Kelso, 2009), or fixed or mutable local muscle groupings (Woo et al., 2019a; Stone et al., 2004).

Figure 1: A flowchart of our method. Subject-specific motion tracking results from tagged MRI are first transformed into an atlas space representing a neutral tongue position. Our deep learning framework is then used to determine the common as well as subject-specific functional units.

Medical imaging techniques including magnetic resonance imaging (MRI) have been used to characterize functional units of speech movements (Woo et al., 2019a; Stone et al., 2004). In particular, tagged MRI allows us to non-invasively track spatiotemporally varying speech movements at the voxel level (Parthasarathy et al., 2007; Xing et al., 2017; Osman et al., 1999)

. More specifically, MR tagging is capable of generating temporary grid-like patterns via a sequence of radiofrequency pulses within the tissue. This is achieved by spatially modulating longitudinal magnetization of hydrogen protons. As a result, the induced temporary grid-like tagging patterns deform alongside tongue motion, and are visible perpendicular to tagging planes. Then, 2D plus time or 3D plus time velocity fields at the voxel level are typically estimated via tracking algorithms based on harmonic phase (HARP) 

(Parthasarathy et al., 2007; Osman et al., 1999; Xing et al., 2017).

In order to identify the “manageable” number of degrees of freedom of speech movements and better understand their spatial and temporal couplings between different parts of the tongue, various modeling approaches have been developed. Non-negative matrix factorization (NMF) (Lee and Seung, 1999) and its variants including sparse NMF are well-recognized, given that NMF is capable of examining signals derived from intrinsic muscle activations that are non-negative  (Ting and Chvatal, 2010). Sparse NMF (Kim and Park, 2008) is a matrix decomposition approach, where an input matrix whose entries are non-negative, is expressed as a sparse linear combination of a set of building blocks. Since building blocks can be seen as an underlying anatomical basis, its associated weighting map can be used to reveal consistent and coherent sub-motion patterns. To further characterize underlying physiology of speech movements using NMF, additional prior knowledge such as manifold geometry of input movement data has also been investigated (Woo et al., 2019a; Cai et al., 2010).

There are two major challenges to be addressed in this work. First, the prior approach (Woo et al., 2019a) to identify functional units using sparse NMF is based on a shallow NMF model, which may not capture the underlying tongue’s complex physiology accurately. This is because the human tongue is a structurally and functionally complex myoarchitecture (Woo et al., 2015; Stone et al., 2018). In addition, in order to deform local regions within the tongue, it is necessary to activate a complex set of this muscular array which may be directly or proportionally associated with neuromuscular modules. This is, however, largely unknown at present, but is studied by many researchers using various approaches. Therefore, there is a need to develop an NMF model that can learn complex muscle coordination patterns from motion features derived from speech movements, while retaining the constraints and advantages of an NMF model which can deal with non-negative signals and offer parts-based and interpretable representations, respectively. Second, because of the different motions that tongues produce during the course of speech, functional units vary substantially from one subject to another. Thus, one of the important hurdles in analyzing functional units is how to keep identified functional units across subjects comparable due to their substantial variability. Independently applying an NMF model to determining individual functional units may result in identifying building blocks and their weighting that are suboptimal, thereby yielding results that are challenging to objectively compare across subjects.

To alleviate the aforementioned challenges, we present a normalization method that can identify both the common and subject-specific functional units in a cohort of speakers in an atlas space from tagged MRI and 3D plus time voxel-level tracking by extending our prior work (Woo et al., 2020). In contrast to the prior work (Woo et al., 2020), we further describe a refined method using a deep joint sparse NMF framework to identify spatiotemporally varying functional units using a simple utterance and carry out extensive validations on both simulated and in vivo tongue motion data. Our deep joint sparse NMF framework computes a set of building blocks and both subject-specific and common weighting maps given motion quantities from tagged MRI. We then apply spectral clustering to the common and subject-specific weighting maps to jointly determine the common functional units across subjects and the subject-specific functional units for each subject.

The contributions of the proposed method can be summarized as follows:

  • The most prominent contribution of this work is to construct an atlas of the functional units—i.e., the common consensus functional units—of how tongue muscles coordinate to produce target observed motion in a healthy population from cine and tagged MRI.

  • This proposed work can simultaneously yield both common as well as subject-specific functional units within a material coordinate system with reduced size variability, thereby greatly facilitating the comparison of identified functional units during speech across subjects.

  • This proposed work converts NMF with sparse and manifold regularizations into modular architectures by means of unfolding ISTA, thereby accurately capturing the sub-motion patterns through each subject’s underlying low-dimensional subspace.

  • This proposed work shows superior clustering performance over the comparison methods on both simulated and in vivo tongue motion datasets, thus demonstrating the potential for advancing our understanding of speech motor control and therapeutic, rehabilitative, and surgical procedures.

The rest of this paper is structured as follows. Section 2 reviews related work. Section 3 defines the problem and describes our proposed approach. The experimental results are shown in Section 4, and Section 5 presents a discussion. Finally, we conclude this paper in Section 6.

2 Related Work

2.1 Functional Units

Various attempts have been made to investigate functional units of tongue motion during speech using imaging and motion capture techniques. For example, Green and Wang (2003) studied functionally independent articulators within the tongue based on a correlation analysis from x-ray microbeam database. In that work, the functional independence was assessed by means of movement coupling relations, demonstrating phonemic differentiation in vertical tongue motions from 20 vowel-consonant-vowel (VCV) combinations. Similarly, Stone et al. (2004) examined the functional independence of five segments within the tongue during speech using a correlation analysis from 2D plus time ultrasound and tagged MRI. That work demonstrated that adjacent segments have high correlations, while distant segments have negative correlations, which is consistent with linguistic constraints. Ramanarayanan et al. (2013) proposed a computational framework to identify linguistically interpretable tongue movement primitives of speech articulation data based on a convolutive NMF algorithm with sparseness constraints from electromagnetic articulography and synthetic data generated via an articulatory synthesizer. Woo et al. (2019a) presented a framework to examine functional units using a shallow graph-regularized sparse NMF model from tagged MRI and 3D plus time voxel-level tracking. Recently, Sorensen et al. (2019) investigated a functional grouping of articulators and its variability across participants from real-time MRI. All of this work, however, investigated subject-specific functional units and therefore lack an understanding of the common functional units in a healthy population. In this work, we extend our prior approaches (Woo et al., 2019a, 2020) to develop a deep joint sparse NMF framework that can co-identify common and subject-specific functional units across participants.

2.2 Deep NMF

The recent success of deep neural networks allowed many researchers to investigate “deep NMF.” For example, a deep unfolding method was developed, yielding a new formulation that can be trained using a multiplicative back-propagation method (Hershey et al., 2014). In addition, deep NMF (Le Roux et al., 2015)

was proposed by unfolding the NMF iterations and untying its parameters for the application of audio source separation. Furthermore, a new architecture combining NMF with deep recurrent neural networks 

(Wisdom et al., 2017) was presented by unfolding the iterations of Iterative Shrinkage-Thresholding Algorithm (ISTA) (Gregor and LeCun, 2010). In the present work, we aim to develop deep NMF with both sparse and manifold regularizations by unfolding the iteration of ISTA. We note that a similar idea has been explored in prior work described above, but this work further incorporates both sparse and manifold regularizations into the deep NMF framework.

3 Methodology

3.1 Participants and MRI Data Collection

In this work, a total of 18 healthy speakers were included. Table 1

lists the characteristics of subjects. Each speaker was trained prior to the MR scan in order to speak a simple utterance (“a souk”) in line with a periodic metronome-like sound. Each speaker then repeated the speech words following the periodic sound while acquiring T2-weighted 2D tagged and cine MRI through a Siemens 3.0 T Tim Trio system (Siemens Medical Solutions, Erlangen, Germany) with a 12-channel head coil and 4-channel neck coil. Both dynamic MR images were acquired at 26 frames per second with three orthogonal directions, including coronal, axial, and sagittal directions. Then, for cine MRI, a super-resolution volume reconstruction technique 

(Woo et al., 2012) was used to combine three orthogonal stacks to yield a single volume with isotropic resolution.

Subject Age Gender Subject Age Gender
1 23 M 10 26 F
2 31 F 11 22 M
3 27 F 12 43 M
4 41 F 13 27 M
5 35 M 14 42 F
6 45 F 15 59 F
7 27 F 16 52 M
8 22 F 17 54 M
9 22 F 18 27 M
Table 1: Characteristics of 18 healthy subjects

3.2 Estimation of Subject-specific Motion Fields from Tagged MRI

For the 3D plus time motion estimation, we use a tracking method by Xing et al. (2017)

that hinges on symmetric and diffeomorphic registration with HARP phase volumes to yield a sequence of voxel-level motion fields during the course of the speech tasks from tagged MRI. In brief, 2D slices into 3D voxel locations are interpolated using cubic B-spline. Then, a HARP tracking method 

(Osman et al., 1999) is utilized to yield HARP phase volumes. Finally, the iLogDemons method (Mansi et al., 2011) is applied to finding symmetric and diffeomorphic transformations from a reference time frame to the target time frame. The transformations are given by

(1)

where =18 and time frame = 1,, , where =26, in these phase volumes. Finding symmetric and diffeomorphic transformations with the volume preserving constraint is crucial for tongue motion analysis. This is because the volume of the tongue remains constant in the course of transformation, while the smoothness of anatomical details within the tongue needs to be preserved.

3.3 Identification of Subject-specific Functional Units via a Deep Sparse NMF Framework

Assume that the tongue is comprised of distinct clusters—i.e., functional units—in the course of a given phoneme of interest, each of which corresponds to a characteristic motion from which muscle coordinations and interactions occur. In this work, we opt to use graph-regularized sparse NMF to identify functional units for the following reasons. First, in order to accurately characterize each functional unit, it is necessary to project the high-dimensional and complex 3D plus time voxel-level tracking into a low-dimensional subspace in which each axis corresponds to a particular sub-motion pattern. In addition, it is natural that functional units comprising a subset of intrinsic and extrinsic muscles are not entirely independent of each other; and there could be some overlaps among them. Furthermore, since it is assumed that functional units are the result of additive mixture of the underlying muscle activations, the linear combination coefficients—i.e., weighting maps—need to take non-negative values only.

Mathematically, the objective of NMF is to factorize a non-negative matrix into the non-negative matrix , the building blocks, and the non-negative matrix , the weighting map, that minimizes the following objective function:

(2)

where represents the matrix Frobenius norm defined as

(3)

Here, denotes the trace of a matrix. Among other divergence measures (Lee and Seung, 1999; Cichocki et al., 2008; Sra and Dhillon, 2006), we focus on the Frobenius norm to compute dissimilarity between the non-negative input data matrix and its approximation . The objective function of graph-regularized sparse NMF can be defined as:

Figure 2: The block diagram shows learned ISTA by unfolding the iteration of ISTA for sparse NMF (2 times in this figure).
(4)

where and denote the balancing parameters of the sparsity and manifold regularizations, respectively, and represents the graph Laplacian matrix. The graph Laplacian is defined as , where is a heat kernel weighting function associated with the input matrix and the degree matrix is a diagonal matrix whose entries are . Minimizing the manifold regularization term, , serves as a smoothing operator. In this work, the building block, , is first trained using the work by Cai et al. (2010). The ISTA method is then used to solve Eq. (4) for as in Fig. 2:

(5)
(6)

where 1/, , and denote the step size, the ISTA iteration index, and the soft thresholding function with a threshold value /c, respectively. is given by

(7)

Because of the non-negative constraint imposed on

, the soft-thresholding operation can be seen as a rectified linear unit (ReLU) activation function. It is worth noting that this minimization is equivalent to a fully connected layer, followed by ReLU activation, which bears structural similarity with the current deep neural network models.

3.3.1 Spectral Clustering

Once we obtain the weighting map from the ISTA method, we carry out clustering on the weighting map, the goal of which is to partition the weighting map into disjoint subsets with high intra-cluster similarity while maintaining low inter-cluster similarity via the eigen-structure of a data affinity graph. First, we construct an affinity matrix from the weighting map which can be given by

(8)

where denotes the

-th column vector of the weighting map

and represents the scale factor. Then, spectral clustering (Shi and Malik, 2000) is carried out on the affinity matrix, followed by color-coding of each voxel within the tongue for visualization.

3.4 Deep Joint GS-NMF to Co-identify the Common and Subject-specific Functional Units

3.4.1 Construction of an average intensity and motion field atlas for a reference state from cine and tagged MRI

An average intensity and motion field atlas (Woo et al., 2019b)—i.e., motionless atlas—is built for a reference time frame from cine and tagged MRI. Due to large variability in speech movements across subjects even for the same speech task, putting all the data into an atlas space is crucial to facilitate the comparison of subjects by standardizing varying tongue shape and size, and motion field for each subject. Toward this goal, a symmetric diffeomorphic registration using a cross-correlation (CC) similarity metric is used to construct the average intensity atlas (Avants et al., 2011). Let denote the diffeomorphic transformation between the volume of the -th subject and the atlas volume. Then all the motion tracking results are mapped to the atlas space using the following transformation:

(9)

where represents a motion field from the reference time frame to the -th time frame of the -th subject transformed in the atlas space. This Lagrangian configuration allows us to anchor the root of the motion fields in the material coordinates, thereby allowing for all motion features to be mapped back to the static anatomy (Woo et al., 2019b).

3.4.2 Deep Joint GS-NMF

Once we establish the atlas and put all the data in the atlas space, we form a feature matrix for further identifying functional units (Woo et al., 2019a). Let denote an input non-negative feature matrix of the -th subject, consisting of the magnitude and angle derived from displacements (Woo et al., 2019a). The objective function of GS-NMF for identifying individual functional units is defined as

(10)

where , , and represent building blocks, their weighting map, and the graph Laplacian matrix for the -th subject, respectively and and

are weighting parameters for manifold and sparsity regularizations, respectively. In order to identify the common weighting map, the following loss function, a measure of disparity between each weighting map and the common weighting map, is defined as

(11)

where represents the common weighting map.

The overall objective function to find the building blocks, subject-specific weighting maps, and common weighting map is then defined as

(12)

where represents a weighting parameter between the GS-NMF reconstruction error and the disparity term incorporating the common weighting map.

The objective function is optimized via an iterative and alternative ISTA update scheme as follows: (1) pre-training and using the work by Cai et al. (2010), and by Eq. 15 (2) solving for by fixing and , and (3) solving for by fixing and .

(13)
(14)
(15)

where 1/, , and denote the step size, the ISTA iteration index, and the soft thresholding function with a threshold value /c as in Eq. (7), respectively.

3.4.3 Spectral Clustering

The final subject-specific and common functional units are then obtained by applying spectral clustering to the weighting map for each subject and the common weighting map as described in Sec. 3.3.1.

4 Experimental Results

In this section, we validate our approach described above on 2D plus time and 3D plus time synthetic motion data first because of the lack of ground truth. We then use in vivo tongue motion data obtained by tagged MRI to determine the common and subject-specific functional units.

4.1 Experiments Using Synthetic Tongue Motion Data

Our strategy for quantitative evaluation was to use the proposed method and the comparison methods to extract groupings from a synthetic displacement field composed of known areas representative of functional units. Then we analyzed the difference between the grouping as outputted by the methods against the known distribution. We constructed simulated 2D and 3D displacement fields based on a tongue geometry derived from a vocal tract atlas previously developed (Woo et al., 2015; Stone et al., 2018).

The 2D displacement fields were based on the areas illustrated in Fig. 3 and include Lagrangian displacements of heterogeneous magnitude representative of vertical, horizontal movement and rotations in one deformed configuration. We used two quantitative measurements: accuracy (AC) and mutual information (MI). Table 2 lists numerical comparisons between graph-regularized NMF + spectral clustering (G-NMF-S), graph-regularized sparse NMF + spectral clustering (GS-NMF-S) (Woo et al., 2019a), and ISTA for sparse NMF + spectral clustering (ISTA-S-NMF-S), and the proposed approach (ISTA-GS-NMF-S). The results indicated that our approach surpassed the comparison methods. In our experiments, we chose =500, =100, =0, =10, and =0.07 for ISTA-S-NMF-S and =500, =100, =0.05, =10, and =0.07 for our approach. These parameters were chosen empirically to maximize the clustering performance.

The 3D displacement fields included two temporal sequences of Lagrangian motion across 11 time frames each. The first dataset includes spatially heterogeneous displacement fields as displayed in Fig. 4. The displacements are distributed based on the location of the verticalis (V), superior longitudinal (SL), and Transverse (T), which were defined using the vocal tract atlas for each time frame. We note that in the first dataset, the V and SL as well as the V and T muscles interdigitate with each other, respectively. Thus, we have a total of four ground truth labels in our quantitative evaluation. In addition, the V and SL muscles were rotated downward and upward, respectively, while the T muscle was translated upward in the course of 11 time frames (see Fig. 4). The second dataset also has composite Lagrangian displacement fields from 11 time frames as displayed in Fig. 5. We used the composite displacement field of GG, T, and geniohyoid (GH) which also were defined using the vocal tract atlas. We note that in the second dataset, the GG and T interdigitate with each other and therefore we have a total of four ground truth labels in our quantitative evaluation. The GG and T muscles were rotated downward and upward, respectively, while the GH muscle was translated upward in the course of 11 time frames (see Fig. 5). The clustering outcomes using different methods are listed in Table 3, demonstrating that our approach outperformed the comparison methods. In our experiments, for the first dataset, we chose =890, =55, =0.03, =49, and =0.05 for ISTA-S-NMF-S and =890, =55, =0.03, =49, and =0.05 for our approach. For the second dataset, we chose =800, =100, =0, =50, and =0.03 for ISTA-S-NMF-S and =800, =100, =0.05, =50, and =0.03 for our approach. These parameters were chosen empirically to maximize the clustering performance.

Figure 3: Illustration of 2D synthetic tongue motion simulation: (A) 2D displacement field, (B) the ground truth labels, (C) the result using ISTA-S-NMF-S, and (D) the result using our proposed approach. The different color represents different class labels.
2D Tongue () G-NMF-S GS-NMF-S ISTA-S-NMF-S ISTA-GS-NMF-S
AC 85.74 86.15 91.23 97.53
MI 89.31 89.16 94.30 97.84
Table 2: Clustering Performance of 2D Tongue Simulation Data
Figure 4: Illustration of 3D synthetic tongue motion simulation (data 1): (A) 3D displacement field and (B) the ground truth labels.
Figure 5: Illustration of 3D synthetic tongue motion simulation (data 2): (A) 3D displacement field and (B) the ground truth labels.
Data 1 () G-NMF-S GS-NMF-S ISTA-S-NMF-S ISTA-GS-NMF-S
AC 98.57 98.58 99.92 99.95
MI 95.70 95.73 99.58 99.72
Data 2 () G-NMF-S GS-NMF-S ISTA-S-NMF-S ISTA-GS-NMF-S
AC 99.37 99.36 100 100
MI 98.00 97.99 100 100
Table 3: Clustering Performance of 3D Tongue Motion Simulation Data
Figure 6: Illustration of the common functional units identified using our proposed approach. The top and bottom rows show two and three clusters of the functional units for the transitions of /uh/-/s/, /s/-/u/, and /u/-/k/, respectively.
Figure 7: Illustration of the subject-specific functional units identified using (Woo et al., 2019a). The top and bottom rows show two and three clusters of the functional units for the transitions of /uh/-/s/, /s/-/u/, and /u/-/k/, respectively.
Figure 8: Illustration of the subject-specific functional units identified using our approach. The top and bottom rows show two and three clusters of the functional units for the transitions of /uh/-/s/, /s/-/u/, and /u/-/k/, respectively.
Figure 9: The comparison of the sizes of the two functional units using our approach and the previous approach for the transitions of /uh/-/s/, /s/-/u/, and /u/-/k/.
Figure 10: The comparison of the sizes of the three functional units using our approach and the previous approach for the transitions of /uh/-/s/, /s/-/u/, and /u/-/k/.

4.2 Experiments Using In Vivo Tongue Motion Data

We applied our proposed framework to a cohort of healthy subjects with a simple word “a souk” to identify both the common and subject-specific functional units in the atlas space. We first transformed all the motion fields into the atlas space. Second, we extracted the motion quantities including the magnitude and angle of the motion trajectories and constructed an input spatiotemporal matrix containing 18 healthy subjects. Finally, we scaled the matrix, which was then inputted into our deep joint sparse NMF framework described above. The Student’s -test was used to compare the results from different approaches with a level of significance set at 0.05. In all the experiments below, we chose =1, =1000, =20, =100, and =10.

Fig. 6 shows two and three units of the common functional units of three distinct phonemes, including (1) /uh/-/s/, (2) /s/-/u/, and (3) /u/-/k/ from “a souk”. For the transition of /uh/ to /s/, the two functional units (top row) show that the tip and base of the tongue are clustered together, which represents forward/upward motion while the posterior tongue is clustered as a separate unit, which represents forward motion. The three functional units (bottom row) create clear divisions between the tongue base, tip, and the posterior tongue. For the transition of /s/ to /u/, the two functional units (top row) show divisions between the anterior and posterior tongue. The three functional units (bottom row) further form clear divisions between the anterior, base, and posterior tongue. For the transition of /u/ to /k/, the upper tongue is clustered, since the tongue body is elevated, while the base and the body of the tongue are divided into separate units.

Figs. 7 and 8 illustrate two and three units of the subject-specific functional units (subject 5 in Table 1) using the previous approach (Woo et al., 2019a) and the proposed approach, respectively. The results from the proposed approach in Fig. 8 show clearer divisions between the units, thereby yielding more interpretable results in relation to the common functional units than the previous approach as visually assessed.

Figs. 9 and 10

illustrate the comparisons of the sizes of the identified functional units across subjects. For the transition of /uh/ to /s/, the standard deviations of the sizes of the identified functional units from the previous approach and our approach were 10.6

and 9.8 for the two units (=0.89), respectively. For the three units, the standard deviations of the sizes of the identified functional units from the previous approach and our approach were 8.8 and 4.8 for unit 1 (=0.72), 6.5 and 4.5 for unit 2 (=0.12), and 5.4 and 5.2 for unit 3 (=0.53), respectively. The results indicated that our approach yielded reduced variability of the sizes of the functional units and that our approach and the previous approach did not show significant statistical difference for all the units.

For the transition of /s/ to /u/, the standard deviations of the sizes of the functional units from the previous approach and our approach were 7.7 and 7.5 for the two units (0.05), respectively (see Table 4). For the three units, the standard deviations of the sizes of the functional units from the previous approach and our approach were 5.6 and 4.2 for unit 1 (=0.06), 4.6 and 5.1 for unit 2 (=0.91), and 6.7 and 4.8 for unit 3 (=0.25), respectively (see Table 5). The results indicated that our approach yielded reduced variability of the sizes of the functional units and that our approach and the previous approach did not show significant statistical difference except for the two units.

For the transition of /u/ to /k/, the standard deviations of the sizes of the functional units from the previous approach and our approach were 10.2 and 6.3 for the two units (=0.65), respectively (see Table 4). For the three units, the standard deviations of the sizes of the functional units from the previous approach and our approach were 6.0 and 4.1 for unit 1 (=0.84), 9.3 and 5.6 for unit 2 (=0.80), and 8.5 and 6.2 for unit 3 (=0.72), respectively (see Table 5). We note that the result in Fig 7 was identified by the previous approach (Woo et al., 2019a) in the atlas space while the result in Fig 8 was co-identified with the common functional units in Fig. 6. The results indicated that our approach yielded reduced variability of the sizes of the functional units and that our approach and the previous approach did not show significant statistical difference.

Methods /uh/-/s/ /s/-/u/ /u/-/k/
Unit 1 Unit 2 Unit 1 Unit 2 Unit 1 Unit 2
Previous 51.810.6 48.210.6 55.47.7 44.67.7 48.210.2 51.810.2
Ours 52.19.8 47.99.8 42.97.5 57.17.5 47.16.3 52.96.3
Table 4: Statistics of the sizes of functional units (MeanSD)
Methods /uh/-/s/ /s/-/u/ /u/-/k/
Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3
Previous 33.98.8 34.66.5 31.55.4 34.25.6 33.54.6 32.36.7 31.96.0 32.19.3 35.98.5
Ours 34.64.8 31.94.5 32.35.2 32.04.2 33.75.1 34.34.8 32.44.1 33.05.6 34.66.2
Table 5: Statistics of the sizes of functional units (MeanSD)

5 Discussion

The quest for identifying intrinsic “dimension-reduced modular structures”—i.e., functional units—has been central to research on speech production including motor control theories from different perspectives. Early findings (Öhman, 1967; Mermelstein, 1973) indicated that the tongue is separated into tip and body carrying out “quasi-independent” motions. A recent study (Stone et al., 2004)

suggested that the tongue could be further divided into anterior, dorsal, middle, and posterior regions carrying out “quasi-independent” motions. Additionally, there is a great deal of work investigating factor analytic models, including Principal Component Analysis (PCA) 

(Slud et al., 2002; Stone et al., 1997, 2014; Xing et al., 2016) and NMF (Ramanarayanan et al., 2013; Woo et al., 2019a)

, to represent tongue motions as linear combinations of the potential basic factors. Our study furthers this underlying framework via a data-driven approach in which any different size, shape, and region of the tongue can constitute this modular structure according to the task at hand. This is made possible, in part, owing to recent technological advancements in MR imaging and analysis and machine learning that allow us to examine both tongue structure and function at an unprecedented resolution and accuracy.

To mine such a modular structure inherent in speech movements using NMF, Woo et al. (2019a) proposed to incorporate two additional constraints including sparsity and manifold geometry about the motion patterns to determine a set of optimized and geometrically meaningful structures. This graph-regularized sparse NMF formulation allows to compute a low-dimensional yet interpretable subspace, followed by identifying subject-specific functional units via spectral clustering. More recently, Woo et al. (2020) investigated the use of the same sparse NMF framework in a groupwise setting to co-identify the common and subject-specific functional units to increase interpretability due to large variability in the identified functional units across subjects. In the present work, we further proposed a joint deep graph-regularized sparse NMF and spectral clustering to co-identify the common and subject-specific functional units. This, in turn, increased interpretability and decreased size variability in the identified functional units compared with the previous approach (Woo et al., 2020). In addition, the identified subject-specific functional units are jointly obtained alongside the common functional units, thereby greatly facilitating the comparison of each subject with another.

To achieve deep NMF, we converted the standard NMF with sparse and manifold regularizations into modular architectures using unfolding ISTA to learn building blocks and associated weighting map. The deep NMF using unfolding ISTA (Gregor and LeCun, 2010) has been studied previously, but it is worth noting that, to our knowledge, this is the first attempt at incorporating both sparse and manifold regularizations into the ISTA framework. In addition, we further introduced a common low-dimensional subspace that can learn the common weighting map jointly with subject-specific weighting maps across subjects.

Quantitative evaluation of the proposed work in the context of in vivo

tongue motion is a challenging task. The notion of accuracy within our unsupervised learning setting is ill-posed as accurate validation is impossible due to the lack of ground truth other than simulation studies and visual assessment with a thorough knowledge of tongue structure and function. In the present work, a tongue motion simulator based on a vocal tract atlas 

(Woo et al., 2015) was used to generate Lagrangian tongue motion. With this simulator alongside the ground truth, we were able to validate our method, showing superior performance over the comparison methods.

There are a few ways to expand on this work. First, the human tongue consists of numerous intrinsic and extrinsic muscles, each of which has distinct roles to compress and expand tissue points. For example, GG has a muscular architecture that locally activates different parts of the muscle, from GG anterior to GG posterior (Miyawaki, 1975; Stone et al., 2004). As such, identifying such fine-grained local functional units within single muscle or a subset of muscles in a hierarchical manner would reveal a new insight into the mechanisms of how different elements of muscular architecture interact with each other. In order to accurately localize the internal muscles, structural MRI or diffusion MRI are needed as they can provide the location of the internal muscles or fiber architecture, respectively. In addition, our framework can be applied to patient populations, such as, those with amyotrophic lateral sclerosis (Xing et al., 2018; Lee et al., 2018) or tongue cancer with speech or swallowing impairments; assessing how local functional units adapt after a variety of treatments can potentially advance therapeutic, rehabilitative, and surgical procedures.

To the best of our knowledge, this is the first report identifying common and subject-specific functional units from cine and tagged MRI. The atlas constructed from cine MRI was used as a reference anatomical configuration for subsequent analyses to identify and visualize the functional units of the internal motion patterns during speech. In this way, it was possible to contrast and compare the identified functional units across subjects that were not biased by each subject’s anatomical characteristics. In addition, the proposed work furthered this underlying concept in which constructing the atlas of functional units was carried out in a low-dimensional subspace, since correspondences across subjects in the low-dimensional subspace were guaranteed through the reference material coordinate system. Therefore, the proposed work holds promise to provide a link between internal tongue motion and underlying low-dimensional subspace, thereby advancing our understanding of the inner workings of the tongue during speech. In addition, the identified common and subject-specific functional units could offer a unique resource in the scientific research community and open new vistas for functional studies of the tongue.

6 Conclusion

In this work, we presented a new method to jointly identify common and subject-specific functional units. To address limitations of shallow NMF and identify comparable and interpretable functional units across subjects, a deep joint NMF framework incorporating sparse and manifold regularizations was proposed. Our proposed method was extensively validated on synthetic and in vivo tongue motion data to demonstrate the benefit of its novel features. Our results show that our method can determine the common and subject-specific functional units with increased interpretability and decreased size variability.

Acknowledgments

This work is partially supported by NIH R01DC014717, R01DC018511, R01CA133015, R21DC016047, R00DC012575, P41EB022544 and NSF 1504804 PoLS.

References

  • Avants et al. (2011) Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A., Gee, J.C., 2011. A reproducible evaluation of ANTs similarity metric performance in brain image registration. Neuroimage 54, 2033–2044.
  • Bizzi et al. (1991) Bizzi, E., Mussa-Ivaldi, F.A., Giszter, S., 1991. Computations underlying the execution of movement: a biological perspective. Science 253, 287–291.
  • Cai et al. (2010) Cai, D., He, X., Han, J., Huang, T.S., 2010. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence 33, 1548–1560.
  • Cichocki et al. (2008) Cichocki, A., Lee, H., Kim, Y.D., Choi, S., 2008. Non-negative matrix factorization with -divergence. Pattern Recognition Letters 29, 1433–1440.
  • Gaige et al. (2007) Gaige, T.A., Benner, T., Wang, R., Wedeen, V.J., Gilbert, R.J., 2007.

    Three dimensional myoarchitecture of the human tongue determined in vivo by diffusion tensor imaging with tractography.

    Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine 26, 654–661.
  • Gick and Stavness (2013) Gick, B., Stavness, I., 2013. Modularizing speech. Frontiers in psychology 4, 977.
  • Green and Wang (2003) Green, J.R., Wang, Y.T., 2003. Tongue-surface movement patterns during speech and swallowing. The Journal of the Acoustical Society of America 113, 2820–2833.
  • Gregor and LeCun (2010) Gregor, K., LeCun, Y., 2010. Learning fast approximations of sparse coding, in: Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399–406.
  • Hershey et al. (2014) Hershey, J.R., Roux, J.L., Weninger, F., 2014. Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574 .
  • Kelso (2009) Kelso, J.S., 2009. Synergies: atoms of brain and behavior, in: Progress in motor control. Springer, pp. 83–91.
  • Kim and Park (2008) Kim, J., Park, H., 2008. Sparse nonnegative matrix factorization for clustering. Technical Report. Georgia Institute of Technology.
  • Le Roux et al. (2015) Le Roux, J., Hershey, J.R., Weninger, F., 2015. Deep NMF for speech separation, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 66–70.
  • Lee and Seung (1999) Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791.
  • Lee et al. (2018) Lee, E., Xing, F., Ahn, S., Reese, T.G., Wang, R., Green, J.R., Atassi, N., Wedeen, V.J., El Fakhri, G., Woo, J., 2018. Magnetic resonance imaging based anatomical assessment of tongue impairment due to amyotrophic lateral sclerosis: A preliminary study. The Journal of the Acoustical Society of America 143, EL248–EL254.
  • Mansi et al. (2011) Mansi, T., Pennec, X., Sermesant, M., Delingette, H., Ayache, N., 2011. iLogDemons: A demons-based registration algorithm for tracking incompressible elastic biological tissues.

    International journal of computer vision 92, 92–111.

  • Mermelstein (1973) Mermelstein, P., 1973. Articulatory model for the study of speech production. The Journal of the Acoustical Society of America 53, 1070–1082.
  • Miyawaki (1975) Miyawaki, K., 1975. A preliminary report on the electromyographic study of the activity of lingual muscles. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics, University of Tokyo 9, 91–106.
  • Öhman (1967) Öhman, S.E., 1967. Numerical model of coarticulation. The Journal of the Acoustical Society of America 41, 310–320.
  • Osman et al. (1999) Osman, N.F., Kerwin, W.S., McVeigh, E.R., Prince, J.L., 1999. Cardiac motion tracking using cine harmonic phase (HARP) magnetic resonance imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 42, 1048–1060.
  • Parthasarathy et al. (2007) Parthasarathy, V., Prince, J.L., Stone, M., Murano, E.Z., NessAiver, M., 2007. Measuring tongue motion from tagged cine-MRI using harmonic phase (HARP) processing. The Journal of the Acoustical Society of America 121, 491–504.
  • Ramanarayanan et al. (2013) Ramanarayanan, V., Goldstein, L., Narayanan, S.S., 2013. Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation. The Journal of the Acoustical Society of America 134, 1378–1394.
  • Shi and Malik (2000) Shi, J., Malik, J., 2000. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22, 888–905.
  • Slud et al. (2002) Slud, E., Stone, M., Smith, P.J., Goldstein Jr, M., 2002. Principal components representation of the two-dimensional coronal tongue surface. Phonetica 59, 108–133.
  • Sorensen et al. (2019) Sorensen, T., Toutios, A., Goldstein, L., Narayanan, S., 2019. Task-dependence of articulator synergies. The Journal of the Acoustical Society of America 145, 1504–1520.
  • Sra and Dhillon (2006) Sra, S., Dhillon, I.S., 2006. Generalized nonnegative matrix approximations with Bregman divergences, in: Advances in neural information processing systems, pp. 283–290.
  • Stone et al. (2004) Stone, M., Epstein, M.A., Iskarous, K., 2004. Functional segments in tongue movement. Clinical linguistics & phonetics 18, 507–521.
  • Stone et al. (1997) Stone, M., Goldstein Jr, M.H., Zhang, Y., 1997. Principal component analysis of cross sections of tongue shapes in vowel production. Speech Communication 22, 173–184.
  • Stone et al. (2014) Stone, M., Langguth, J.M., Woo, J., Chen, H., Prince, J.L., 2014. Tongue motion patterns in post-glossectomy and typical speakers: A principal components analysis. Journal of Speech, Language, and Hearing Research .
  • Stone et al. (2018) Stone, M., Woo, J., Lee, J., Poole, T., Seagraves, A., Chung, M., Kim, E., Murano, E.Z., Prince, J.L., Blemker, S.S., 2018. Structure and variability in human tongue muscle anatomy. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 6, 499–507.
  • Ting and Chvatal (2010) Ting, L.H., Chvatal, S.A., 2010. Decomposing muscle activity in motor tasks. Motor Control: Theories, Experiments, and Applications , 102–138.
  • Wisdom et al. (2017) Wisdom, S., Powers, T., Pitton, J., Atlas, L., 2017. Deep recurrent NMF for speech separation by unfolding iterative thresholding, in: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE. pp. 254–258.
  • Woo et al. (2015) Woo, J., Lee, J., Murano, E.Z., Xing, F., Al-Talib, M., Stone, M., Prince, J.L., 2015. A high-resolution atlas and statistical model of the vocal tract from structural MRI. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 3, 47–60.
  • Woo et al. (2012) Woo, J., Murano, E.Z., Stone, M., Prince, J.L., 2012. Reconstruction of high-resolution tongue volumes from mri. IEEE Transactions on Biomedical Engineering 59, 3511–3524.
  • Woo et al. (2019a) Woo, J., Prince, J.L., Stone, M., Xing, F., Gomez, A.D., Green, J.R., Hartnick, C.J., Brady, T.J., Reese, T.G., Wedeen, V.J., et al., 2019a. A sparse non-negative matrix factorization framework for identifying functional units of tongue behavior from MRI. IEEE transactions on medical imaging 38, 730–740.
  • Woo et al. (2020) Woo, J., Xing, F., Prince, J.L., Stone, M., Reese, T.G., Wedeen, V.J., El Fakhri, G., 2020. Identifying the common and subject-specific functional units of speech movements via a joint sparse non-negative matrix factorization framework, in: SPIE Medical Imaging 2020: Image Processing, International Society for Optics and Photonics. p. 113131S.
  • Woo et al. (2019b) Woo, J., Xing, F., Stone, M., Green, J., Reese, T.G., Brady, T.J., Wedeen, V.J., Prince, J.L., El Fakhri, G., 2019b. Speech MAP: A statistical multimodal atlas of 4D tongue motion during speech from tagged and cine MR images. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 7, 361–373.
  • Xing et al. (2018) Xing, F., Prince, J.L., Stone, M., Reese, T.G., Atassi, N., Wedeen, V.J., El Fakhri, G., Woo, J., 2018. Strain map of the tongue in normal and als speech patterns from tagged and diffusion MRI, in: Medical Imaging 2018: Image Processing, International Society for Optics and Photonics. p. 1057411.
  • Xing et al. (2017) Xing, F., Woo, J., Gomez, A.D., Pham, D.L., Bayly, P.V., Stone, M., Prince, J.L., 2017. Phase vector incompressible registration algorithm for motion estimation from tagged magnetic resonance images. IEEE transactions on medical imaging 36, 2116–2128.
  • Xing et al. (2016) Xing, F., Woo, J., Lee, J., Murano, E.Z., Stone, M., Prince, J.L., 2016. Analysis of 3-D tongue motion from tagged and cine magnetic resonance images. Journal of Speech, Language, and Hearing Research 59, 468–479.