I Introduction
Finding a suitable representation of highdimensional and complex data for various tasks, such as clustering and topic modeling, is a fundamental challenge in many areas such as computer vision, machine learning, data mining, and medical image analysis. Nonnegative matrix factorization (NMF)
[1], an unsupervised generative model, is a class of techniques to capture a lowdimensional representation of a dataset suitable for a clustering interpretation [2]; it also has been used to transform different features to a common class. In this work, we are interested in modeling the tongue’s underlying behaviors using NMF, since nonnegative properties of NMF are akin to the physiology of the tongue as reflected in the matrix decomposition process. Since its introduction by Lee and Seung [1], special attention has been given to NMF and its variants involving sparsity because of its ability to uncover a partsbased and interpretable representation. Specifically, NMF with a sparsity constraint operates on input matrices whose entries are nonnegative, thus allowing us to model a data matrix as sparse linear combinations of a set of basis vectors (or building blocks). NMF with a sparsity constraint only allows nonnegative combinations of building blocks. This is analogous with the analysis of muscle synergies
[4, 9] because a complex set of nonnegative activations of tongue muscles generate the complexity and precision of the tongue’s voluntary and involuntary movements.The human tongue is one of the most structurally and functionally complex muscular structures in our body system, comprising orthogonally oriented and highly interdigitated muscles [5]. The human tongue is innervated by over 13,000 hypoglossal motoneurons [3, 6]. A complex set of neural activations of tongue muscles allows to deform localized regions of the tongue in this complex muscular array. Muscle coordination patterns of the tongue are synergies produced by deforming local muscle groups—i.e., functional units [3, 9]. Functional units can be thought of as an intermediate structure, which links muscle activity to surface tongue geometry, and which also exhibits cohesive and consistent motion in the course of specific tasks. Therefore, determining functional units and understanding the mechanisms by which muscles coordinate to generate target movements can provide new insights into motor control strategy in normal, pathological, and adapted tongue movements after surgery or treatment. This will, in turn, lead to improvement in surgical planning, treatment, or rehabilitation procedures. However, to date, the mechanisms of the muscle coordination patterns and the relationship between tongue structure and function have remained poorly understood because of the greater complexity and variability of both muscular structures and their interactions.
Understanding the subject/taskspecific tongue’s functional organization requires a map of the functional units of the tongue during the oromotor behaviors using medical imaging. In particular, recent advances in medical imaging and associated data analysis techniques have permitted the noninvasive imaging and visualization of tongue structure and function in a variety of ways. Magnetic resonance imaging (MRI) technologies have been widely used, demonstrating that a large number of tongue muscles undergo highly complex deformations during speech and other lingual behaviors. For example, the ability to image noninvasively using MRI has allowed to capture both surface motion of the tongue via cine or realtime MRI [10, 11, 12] and internal tissue point motion of the tongue via taggedMRI (tMRI) [7]. In addition, highresolution MRI and diffusion MRI [27, 26] have allowed to image the muscular and fiber anatomy of the tongue, respectively. With advances in computational anatomy, a vocal tract atlas [13, 14], a representation of the tongue anatomy, has also been created and validated, which allows for investigating the relationship between tongue structure and function by providing a reference anatomical configuration to analyze similarities and variability of tongue motion.
This work is aimed at developing a computational approach for defining the subject/taskspecific and datadriven functional units from tMRI and 4D (3D space with time) voxellevel tracking [15] by extending our previous approach [19]
. In this work, we describe a refined algorithm including an advanced tracking method and graphregularized sparse NMF (GSNMF) to determine spatiotemporally varying functional units using protrusion and simple speech tasks and offer extensive validations on simulated tongue motion data. Since standard NMF formulation hinges on a Euclidean distance measure (i.e. Frobenius norm) or the KullbackLeibler divergence, it fails to mine the intrinsic and manifold geometry of its data
[25]. Our method, therefore, makes use of both sparse and manifold regularizations to identify a set of optimized and geometrically meaningful motion patterns by promoting the computation of distances on a manifold from observed tongue motion data. Our formulation assumes that the tongue motion features live in a manifold space within a sparse NMF framework, thereby finding a lowdimensional yet interpretable subspace of the motion features from tMRI.We present an approach to discovering the functional units using a matrix factorization and probabilistic graphical model framework with the following main contributions:

The most prominent contribution is to use voxellevel datadriven tMRI (1.875mm1.875mm1.875mm) methods incorporating internal tissue points to obtain subjectspecific functional units of how tongue muscles coordinate to generate target observed motions.

Our tMRI imaging and 4D voxellevel tracking paradigm provide previously unavailable internal tongue motion patterns. The proposed approach is scalable to a variety of motion features derived from motion trajectories to characterize the coherent motion patterns. In this work, we consider the most representative features including displacements and angle from tMRI.

This work applies a graphregularized sparse NMF and probabilistic graphical model to the voxellevel motion data, allowing us to estimate the latent functional coherence, by learning simultaneously hidden building blocks and their associated weighting map from a set of motion features.

This work is aimed at identifying 3D cohesive functional units that involve multiple, and possibly undocumented and localized tongue regions unlike previous approaches that largely rely on the tracking of landmark points sparsely located on the fixed tongue surface such as the tip, blade, body, and dorsum.
The remainder of this paper is structured as follows. We review related work in Section II. Then, Section III shows the proposed method to identify functional units. The experimental results are provided and analyzed in Section IV. Section V presents a discussion, and finally Section VI concludes this paper.
Ii Related Work
In this section, we review recent work on NMF for clustering and functional units research that are closely related to our work.
NMF for Clustering. A multitude of NMFbased methods for unsupervised data clustering have been proposed over the last decade across various domains, ranging from video analysis to medical image analysis. In particular, the idea of using norm regularization (i.e., sparse NMF) for the purpose of clustering has been successfully employed [30]. The sparsity condition imposed on the weighting matrix (or coefficient matrix) indicates the clustering membership. For example, Shahnaz et al. [31] proposed an algorithm for document clustering based on NMF, where the method was used to obtain a lowrank approximation of a sparse matrix while preserving natural data property. Wang et al. [33] applied the NMF framework to geneexpression data to identify different cancer classes. Anderson et al. [34] presented an NMFbased clustering method to group differential changes in default mode subnetworks from multimodal data. In addition, Mo et al. [35] proposed a motion segmentation method using an NMFbased method. A key insight to use NMF for clustering purpose is that NMF is able to learn and discriminate localized traits of data with a better interpretation. Please refer to [21] for a detailed review on NMF for clustering purpose.
Speech Production. Research on determining functional units during speech has a long history (see e.g., Gick and Stavness [16] for a recent perspective). Determining functional units is considered as uncovering a “missing link” between speech movements primitives and cortical regions associated with speech production [16]. In the context of lingual coarticulation, functional units can be seen as “quasiindependent” motions [3]. A variety of different techniques have been used to tackle this problem. For instance, Stone et al. [17] showed that four midsagittal regions including anterior, dorsal, middle, and posterior functioned quasiindependently using ultrasound and microbeam data. Green and Wang attempted to characterize functionally independent articulators from microbeam data using covariancebased analysis [8]. More recently, Stone et al. [3] proposed an approach to determining functional segments using both ultrasound and tMRI. That work examined compression and expansion between the anterior and posterior part of the tongue and found that regions or muscles have functional segments influenced by phonemic constraints. Ramanarayanan et al. [18] proposed to use a convolutive NMF method to identify motion primitives of the tongue using electromagnetic articulography (EMA). Being inspired by the aforementioned approaches, our work uses the augmented NMF framework in conjunction with the far richer 4D tMRI based voxellevel tracking to determine functional units of tongue behaviors.
Iii Methodology
Iiia Problem Statement
Without loss of generality, let us first define the notations and definitions about tissue points and derived features used throughout this work. Let us consider a set of internal points of the tongue tracked through tMRI, in which tracked points are used to derive motion features. We now define motion features at each voxel location, having scalar quantities, such as magnitude and angle of each point track, precisely tracked through time frames. Each tissue point is characterized by these scalar quantities, which are then used to cluster them into coherent motion patterns. We denote the position of the th tissue point at the th time frame as (, , ). The tongue motion can be expressed as a spatiotemporal feature matrix , where the th column is expressed as
(1) 
The problem of identifying the functional units is viewed as a clustering problem, since the functional units are localized tongue regions that exhibit characteristic and homogeneous motions. However, different from generic motion clustering problems in computer vision or machine learning, the tongue’s function and physiology also need to be reflected and captured in our formulation. Thus, we opt to identify a permutation of the columns to build , where the submatrix comprises tracks belonging to the th submotion, corresponding to the th functional unit. The motion quantities for each underlying muscle of the tongue are not completely independent. That is, our framework is datadriven and motionspecific at the submuscle level, which could be mapped from a subset of motion quantities from one or more muscles. These latent morphologies—i.e., functional units—inherent in higher dimensional datasets offer a sparse representation of the generative process behind the tongue’s physiology for submuscle, a single or multiple muscles. To achieve this goal, the sparse NMF method is utilized to capture the latent morphologies from motion trajectories. The proposed method is described in more detail below; a flowchart is depicted in Figure 1.
IiiB MR Data Acquisition and Motion Tracking
IiiB1 MR Data Acquisition
All MRI data are acquired on a Siemens 3.0 T Tim Treo system with 12channel head and 4channel neck coil. While a participant performs the protrusion task or simple speech tasks including “a geese” and “a souk” repeatedly in synchrony with metronome sound, the tMRI data are acquired using Magnitude Imaged CSPAMM Reconstructed images [28]. All tMRI data have 26 time frames with a temporal resolution of 36 ms with no delay during a 1 second duration. The resolution is 1.875mm1.875mm6mm.
IiiB2 VoxelLevel Motion Tracking from TaggedMRI
The phase vector incompressible registration algorithm (PVIRA) [37]
is used to estimate deformation of the tongue from tMRI, yielding a sequence of voxellevel motion fields. The steps of the algorithm are as follows. First, although the input is a set of sparsely acquired tMRI slices, PVIRA uses cubic Bspline to interpolate these 2D slices into denser 3D voxel locations. Then a harmonic phase (HARP)
[22] filter is applied to produce HARP phase volumes from the interpolated result. Finally, PVIRA uses the iLogDemons method [23] on these phase volumes. Specifically, we denote the phase volumes as , , , , , and , where , , and denote motion information from three cardinal directions usually contained in orthogonal axial, sagittal, and coronal tagged slices. The volume in the reference time frame is and the volume in the deformed time frame is . The motion update vector field is derived from these phase volumes. At each voxel, the update vector is obtained by(2) 
where is the normalization factor determined by the mean squared value of the image voxel size [41]. , , and are defined by
(3)  
is a stationary motion field that is used as an intermediate step towards the final motion field . Wrapping of phase is defined by
(4) 
and the “starred” gradient is defined by
(5) 
After all iterations are complete, the forward and inverse motion fields can be found by
(6) 
and they are both incompressible and diffeomorphic, making both Eulerian and Lagrangian computations available for the following continuum mechanics operations. Once the motion field is estimated, the new voxel locations are mapped by applying the obtained motion field to the time frame under consideration.
IiiC Extraction of Motion Features
We first extract the motion features from PVIRA to characterize the coherent motion patterns. We use the magnitude and angle of each track as our motion features similar to [36] as described as
(7) 
(8) 
(9) 
(10) 
where is the magnitude of each track and , , and denote the cosine of the angle after projecting two consecutive adjacent point tracks in the , , and axes plus one, respectively. We rescale all the features into the range of 0 to 10 for each feature to be comparable and to satisfy the nonnegative constraint of the NMF formulation.
For further clustering, all the motion features are combined into a single nonnegative matrix , where the th column is given by
(11) 
Since each value of the features is nonnegative, they can be input to our framework.
IiiD Graphregularized Sparse NMF
IiiD1 Nmf
Using a nonnegative feature matrix built by the motion quantities described above and , let be the building blocks and be the weighting map, respectively. The objective of NMF is to produce two output nonnegative matrices (i.e., ): (1) spatial building blocks, which can change over time and (2) a weighting map of those blocks, which weights them over time. We define NMF using the Frobenius norm to minimize the reconstruction error between and [1] as given by
(12) 
where is the matrix Frobenius norm. A popular choice to solve this is to use the multiplicative update rule [1]:
(13) 
(14) 
where the operator denotes elementwise multiplication and the division is elementwise as well.
IiiD2 Sparsity Constraint
Once we obtain the building blocks and their weighting map, the identified weighting map is used to identify the cohesive region, which in turn exhibits functional units. Since there could be many possible weighting maps to generate the same movements, a sparsity constraint is incorporated into the weighting map so that the obtained functional units represent optimized and simplest tongue behaviors to generate target movements. This is also analogous with the current wisdom on phonological theories [20]. In essence, we use the sparsity constraint to encode the highdimensional and complex tongue motion data using a small set of active (or nonnegative) components so that the obtained weighting map is simple and easy to interpret. Within the sparse NMF framework, a fractional regularizer using the norm has been shown to provide superior performance to the norm regularization by producing sparser solutions [40]. Thus, in this work, we use the sparsity constraint in our NMF framework, which is given by
(15) 
Here, the parameter is a weight associated with the sparseness of and is expressed as
(16) 
IiiD3 Manifold Regularization
Despite the highdimensional configuration space of human motions, many motions lie on lowdimensional and nonEuclidean manifolds [39]. NMF with the sparsity constraint described above, however, produces a weighting map using a Euclidean structure in the highdimensional space. Therefore, the intrinsic and geometric relation between motion quantities cannot be captured faithfully in the sparse NMF formulation. Thus, we added a manifold regularization to capture the intrinsic geometric structure similar to [25, 42, 43]. The manifold regularization additionally preserves the local geometric structure. Our final cost function using both regularizations is then given by
(17) 
where represents a weight associated with the manifold regularization, Tr() represents the trace of a matrix, denotes a heat kernel weighting, denotes a diagonal matrix, where and , which is the graph Laplacian.
IiiD4 Minimization
Due to the nonconvex cost function in Eq. (17), a multiplicative iterative method is adopted akin to that used in [43]. Let and be Lagrange multipliers subject to and , respectively. We solve this using the Lagrangian by the definition of the Frobenius norm, , and matrix algebra given by
(18) 
The partial derivatives of with respect to and are given by
(19) 
(20) 
Finally, the update rule is found by using KarushKuhnTucker conditions—i.e., and :
(21) 
(22) 
IiiE Spectral Clustering
Now that we obtain the building blocks and their associated weighting map in Eq. (22
), the widely used spectral clustering is applied to the weighting map which represents a measure of tissue point similarity. It has been reported that spectral clustering outperforms conventional clustering methods such as the Kmeans algorithm
[44]. We use spectral clustering to obtain the final clustering results, which in turn reveals the cohesive motion patterns. Of note, alternative clustering algorithms could be applied in this step.More specifically, from the weighting map as in Eq. (22
), an affinity matrix
is constructed:(23) 
where represents the th column vector of and is the scale. Within the graph cut framework, nodes in the graph are formed by the column vectors of , and the edge weights indicate the similarity calculated between column vectors of . On the affinity matrix, we apply spectral clustering using a normalized cut algorithm [45]. From a graph cut point of view, our method can be seen as identifying subgraphs representing characteristic motions that are different from one another.
IiiF Model Selection
To achieve the best clustering quality, we need to select the optimal number
of clusters, which however is a challenging task. Previous research on coarticulation examined a small number of tongue regions to assess the degrees of freedom of the vocal tract ranging from two (i.e., tip vs. body)
[54] to five units (i.e., fronttoback segments for genioglossus, verticalis, and transverse muscles) [3]. In addition to this empirical knowledge about tongue anatomy and physiology, in this work, we further use a consensus clustering approach introduced by Monti et al. [46] to estimate the optimal number of clusters within the NMF framework by examining the stability of the identified clusters. In brief, NMF may yield different decomposition results depending on different initializations [47, 51]. Therefore, the consensus clustering approach is based on the assumption that sample assignments to clusters would not change depending on the different runs. A connectivity matrix is constructed with entry =1 if samples and are clustered together, and =0 if samples are clustered differently. The consensus matrix is then constructed by taking the average over the connectivity matrices generated by different initializations. In this work, we select the number of runs (30 times) based on the stability of the consensus matrix. Finally, the dispersion coefficient of the consensus matrix is defined as(24) 
where and
denote each entry of the matrix and the row and column size of the matrix, respectively. Since the entries of the consensus matrix represent the probability that samples
and are clustered together, the dispersion coefficient, ranging between 0 and 1, represents the reproducibility of the assignments using different initializations. The optimal number of clusters is then chosen as the one that yields the maximal value.AC ()  Kmeans  NCut  GSNMFK  Our method 

COIL20 (=20)  60.48  66.52  83.75  85.00 
PIE (=68)  23.91  65.91  79.90  84.13 
Tongue (=7)  98.41  99.71  98.51  99.92 
Tongue (=8)  98.72  99.85  99.14  99.89 
Iv Experimental Results
In this section, we describe the qualitative and quantitative evaluation to validate the proposed method. We used 2D image and tongue data, and 3D simulated tongue motion data to demonstrate the accuracy of our approach. Furthermore, 3D in vivo human tongue motion datasets were used to identify functional units during protrusion and simple speech tasks. All experiments were performed on an Intel Core i5 with a clock speed of 3.1 GHz and 16GB memory. The mean computation times for the clustering algorithm for “a souk,” “a geese,” and the protrusion task were 18.3, 21.2, and 10.4 secs, respectively.
Iva Experiments Using 2D Image and Synthetic Motion Data
We used three 2D datasets to validate our method. The first two datasets include the COIL20 image database and the CMU PIE database, each having 20 and 68 labels, respectively. As our third dataset, we used 2D synthetic tongue motion data containing a displacement field from two time frames as shown in Figure 2, where we tested 7 and 8 labels. Our method was compared against Kmeans clustering, a normalized cut method (NCut) [45], and graphregularized sparse NMF with Kmeans clustering (GSNMFK) [25], given the known ground truth and the number of labels. To measure the accuracy, the clustering accuracy (AC) was used similar to [25]. Table I lists the accuracy measure, showing that our method performs better than other methods. In addition, the and norms were compared experimentally, demonstrating that the norm performed better. In our experiments, we chose =70, =100, and = 0.01 for the COIL dataset, =160, =100, and = 0.01 for the PIE dataset, and =100, =800, and = 0.06 for tongue datasets. We chose the parameters empirically that yielded the maximal performance.
AC ()  Kmeans  NCut  GSNMFK  Our method 

A (=2)  91.72  75.84  75.64  94.79 
B (=3)  94.83  95.51  96.16  96.44 
C (=2)  87.91  100  86.43  100 
D (=3)  81.24  99.99  85.31  99.99 
IvB Experiments Using 3D Synthetic Tongue Motion Data
We also assessed the performance of the proposed method using four synthetic tongue motion datasets because no ground truth is available in our tongue motion data. We used a composite Lagrangian displacement field of individual muscle groups based on a tongue atlas [13]. Each muscle group was defined by a mask volume with a value of 1 inside the muscle group, and zero elsewhere. Since the masks were known, it also provided ground truth labels to examine the accuracy of the output of our method. The first and second datasets used genioglossus (GG) and superior longitudinal (SL) muscles. Note that those two muscles interdigitate with each other. In the first dataset, the GG muscle was translated while the SL muscle was rotated radians about the direction in the course of 11 time frames. The interdigitated part was included in the GG muscle. The clustering outcome (i.e., 2 clusters) using our method was displayed in Fig. 3(A). In the second dataset, the same motion was used in the first dataset for the clustering, where the clustering outcome (i.e., 3 clusters) using our method was displayed in Fig. 3(B). The third and fourth datasets used GG and transverse muscles. Note that those two muscles also interdigitate with each other. In the third dataset, the GG muscle was rotated radians about the direction while the transverse muscle was translated in the course of 11 time frames. The clustering outcome (i.e., 2 clusters) using our method was displayed in Fig. 3(C). The interdigitated part was included in the GG muscle. In the fourth dataset, the GG muscle was translated while the transverse muscle rotated radians about the direction in the course of 11 time frames. The clustering outcome (i.e., 3 clusters) using our method was displayed in Fig. 3(D). Table II lists the AC measure, demonstrating that our method outperforms other methods. In our experiments, we chose =100, =100, and = 0.01 for A and C datasets, and =100, =100, and = 0.05 for B and D datasets.
IvC Experiments Using 3D In Vivo Tongue Motion Data
Task  Protrusion  /s//u/  /i//s/ 

Time frames  113  1017  1526 
Number of clusters  25  25  25 
3D in vivo tongue motion data including protrusion and simple speech tasks such as “a souk” and “a geese” were used to identify functional units. One subject performed the protrusion task synchronized with metronome sound, and ten subjects performed “a souk” and “a geese” tasks using the same protocol. We used the same motion quantities described above as our input to the sparse NMF framework. Table III lists the characteristics of our in vivo tongue motion data including time frames analyzed and the number of clusters extracted based on the dispersion coefficient. We used =100, =100, and = 0.01 for our 3D in vivo tongue datasets. Of note, we focused on 24 detected functional units in our interpretation and analysis below.
2 units  3 units  4 units  
Subject 1  47.6  52.4  43.4  19.4  37.2  25.5  33.8  19.1  21.6 
Subject 2  46.2  53.8  30.3  18.8  50.9  21.5  36.2  25.3  17.0 
Subject 3  36.3  63.7  28.1  37.5  34.4  25.1  32.1  19.7  23.1 
Subject 4  46.6  53.4  33.3  31.9  34.8  29.0  24.5  22.8  23.7 
Subject 5  49.3  50.7  28.3  41.6  30.1  25.4  27.8  22.0  24.8 
Subject 6  32.1  67.9  37.2  22.6  40.1  22.0  26.5  20.2  31.3 
Subject 7  35.1  64.9  37.6  29.6  32.8  25.2  31.0  20.4  23.4 
Subject 8  36.7  63.3  18.8  46.8  34.3  33.5  21.4  15.3  29.7 
Subject 9  42.1  57.9  26.3  35.0  38.8  25.1  24.5  18.3  32.1 
Subject 10  43.4  56.6  47.2  20.4  32.4  28.4  27.6  25.1  18.9 
MeanSD  41.56.1  58.56.1  33.08.5  30.49.9  36.65.9  26.13.5  28.54.6  20.83.1  24.65.1 
2 units  3 units  4 units  
Subject 1  44.2  55.8  38.2  28.2  33.6  34.0  21.8  25.2  19.0 
Subject 2  50.1  49.9  33.2  33.0  33.8  31.5  19.9  26.6  22.0 
Subject 3  39.9  60.1  40.3  24.4  24.4  29.3  28.2  25.2  17.3 
Subject 4  52.7  47.3  39.3  30.4  30.2  23.8  23.3  25.5  27.4 
Subject 5  45.7  54.3  38.5  38.5  23.0  33.5  25.7  24.2  16.6 
Subject 6  72.7  27.3  44.0  23.9  32.1  34.4  14.8  27.7  23.1 
Subject 7  50.0  50.0  48.0  20.2  31.8  25.2  32.0  26.1  16.7 
Subject 8  51.1  48.9  37.4  35.3  27.3  18.7  36.2  24.5  20.6 
Subject 9  50.8  49.2  43.6  35.0  21.4  31.0  25.6  24.8  18.6 
Subject 10  50.1  49.9  36.7  33.6  29.7  23.1  32.7  23.8  20.4 
MeanSD  50.89.2  49.29.2  40.34.4  29.96.1  28.64.7  29.15.4  25.36.4  25.51.1  20.13.6 
First, we identified the functional units from the protrusion task, where Fig. 4 shows (B) two clusters, (C) three clusters, and (D) four clusters, respectively, based on the dispersion coefficient as depicted in Fig. 5 and visual assessment. The optimal number of clusters was found to be 3 in this dataset as shown in Fig. 5. The protrusion motion is characterized by forward and upward motion of the tongue as depicted in Fig. 4(A). Fig. 4(B) shows forward protrusion (red) vs. small motion (blue). In addition, as the number of clusters increases as shown in Fig. 4(C) and (D), subdivision of large regions in small motion (blue, Fig. 4(B)) into small functional units was observed.
Second, we identified the functional units during to from “a souk” of subject 9 as shown in Table IV. Fig. 6 depicts (B) two clusters, (C) three clusters, and (D) four clusters, respectively in the same manner. The optimal number of clusters was found to be 2 in this dataset as shown in Fig. 7. Table IV shows the sizes of detected functional units as percentage, where the detected sizes were different from subject to subject and larger variability in the three detected units was observed. During the course of these motions, the tongue tip moves forward to upward/backward, while the tongue body and the posterior tongue move upward and forward, respectively as depicted in Fig. 6(A). Fig. 6(B) shows two clusters including the tip plus bottom of the tongue (red) vs. the tongue body. Three clusters in Fig. 6(C) show a good representation of the tip plus bottom, body and posterior of the tongue and four clusters in Fig. 6(D) further subdivided the tongue tip and bottom.
Third, in the same manner, we identified the functional units during to from “a geese” of subject 8 as shown in Table V. Fig. 8 displays (B) two clusters, (C) three clusters, and (D) four clusters, respectively. The optimal number of clusters was found to be 3 in this dataset as shown in Fig. 9. Table V shows the sizes of detected functional units as percentage, where the detected sizes were different from subject to subject and larger variability in the two detected units was observed. During the course of these motions, the tongue tip moves upward, while the tongue body moves backward and the posterior tongue moves forward as shown in Fig. 8(A). Two clusters in Fig. 8(B) show a division between the tip plus bottom of the tongue (red) vs. the tongue body (blue). Three clusters in Fig. 8(C) are a good representation of the tip plus body, dorsum, and posterior of the tongue and four clusters as depicted in Fig. 8(D) subdivided the posterior of the tongue further.
V Discussion
In this work, we presented an approach for characterizing the tongue’s multiple functional degrees of freedom, which is critical to understand the tongue’s role in speech and other lingual behaviors. This is because the functioning of the tongue in speech or other lingual behaviors entails successful orchestration of the complex system of 3D tongue muscular structures over time. Determining functional units from healthy controls plays an important role in understanding motor control strategy, which in turn could elucidate pathological or adapted motor control strategy when analyzing patient data such as tongue cancer or Amyotrophic Lateral Sclerosis patients. It has been a longsought problem that many researchers attempted using various techniques.
Utilizing recent advances in tMRIbased motion tracking and data mining schemes, a new method for identifying functional units from MRI was presented, which opens new windows for a better understanding of the underlying tongue behaviors and for studying speech production. Unsupervised data clustering using sparse NMF is the task of identifying semantically meaningful clusters using a lowdimensional representation from a dataset. Two constraints in addition to the standard NMF were employed to reflect the physiological properties of 4D tongue motion during protrusion and simple speech tasks. Firstly, the sparsity constraint was introduced to capture the simplest and the most optimized weighting map. Sparsity has been one of important properties for phonological theories [20], and our work attempted to decode this phenomenon within a sparse NMF framework. Secondly, the manifold regularization was added to capture the intrinsic and geometric relationship between motion features. It also allows preserving the geometric structure between motion features, which is particularly important when dealing with tongue motions that lie on lowdimensional nonEuclidean manifolds.
NMF is favorably considered in this work over other alternative matrix decomposition techniques such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), factor analysis, or sparse coding for identifying functional units of tongue motion. While PCA, for example, is appropriate to analyze kinematic or biomechanical features that exhibit both positive and negative values without the explicit assumption of underlying muscle activity, NMF is wellsuited to analyze any signals resulting from muscle activity that are inherently nonnegative
[53]. This, seemingly small, nonnegativity constraint of NMF makes a significant difference between PCA and NMF. First, while, in PCA, the building blocks are orthogonal with each other, in NMF, the building blocks are independent, which means no building blocks can be constructed as a linear combination of other building blocks. Second, in NMF, the building blocks are likely to specify the boundaries (or edges) of the features, thereby constructing a convex hull within which all the features can be found [53]. The ability of NMF to define the boundaries of the features specifies a subspace in which any possible combinations of functional units lie, thus yielding physiologically interpretable results. Third, NMF learns partsbased representation in the sense that a set of parts, when summed, constitutes the whole features. In contrast, in PCA, the elements of the building blocks and their weighting map can cancel each other out, thus obliterating building blocks’ physical meaningfulness.In the experiments using 2D synthetic data, the purpose of the experiments was to assess the accuracy of the clustering performance given known cluster labels in an unsupervised learning setting as there is no ground truth in the
in vivo tongue motion data. In the experiments using 3D tongue motion simulation data, the testing motions in Fig. 3(A) and (C) included rotations about the styloid process and displacements about the inferosuperior directions, which is consistent with in vivo tongue movements. As for the input features, we used the magnitude and angle of point tracks derived from tMRI. More features could be investigated such as those reflecting mechanical properties including principal strains, curvature, minimumjerk, or twothirds power law [55] or motion descriptors combining those individual features.There are a few model selection methods available ranging from heuristic methods to sophisticated probabilistic methods
[56, 57] for different clustering methods. In this work, we chose the number of clusters based on previous research on lingual coarticulation, which divided the tongue into a small set of regions between two (i.e., tip and body) [54] and five units (i.e., fronttoback segments for the genioglossus, verticalis, and transverse) [3]. Additionally, we built a “consensus matrix” from multiple runs for each and assessed the presence of block structure. As an alternative, one can compare reconstruction errors for different number or examine the stability (i.e., an agreement between results) from multiple randomly initialized runs for each . Because there is no ground truth in our tongue data, we have used both visual assessment and the model selection approach in which the previous research on coarticulation provided an upper limit of the number of clusters. A thorough analysis of the optimal number of clusters is a subject for future research.There are a few directions to improve the current work. First, we used a datadriven approach to determine the functional units, which was visually assessed because of the lack of ground truth. This could be improved by further studies using modelbased approaches via biomechanical stimulations [48] or using electromyography [49] to covalidate our findings. For biomechanical simulations, subjectspecific anatomy and the associated weighting map could be input and inverse simulation can then be used to verify the validity of the obtained weighting map. Second, we used the magnitude and orientation of point tracks as our input features. In order to equal the weight of each input feature, we normalized the feature values in the same range. In our future work, we will further study automatic relevance determination methods to model the interactions among these features to yield the best clustering outcome. Finally, the identified functional units as shown in our experimental results may involve multiple regions that correspond to submuscle or multiple muscles. Therefore, we will further perform registration [59] of the identified functional units with the muscular anatomy from individual highresolution MRI, diffusion MRI [60], or a vocal tract atlas [13], and study the relationship between the functional units and underlying muscular anatomy.
Vi Conclusion
We have presented a new algorithm to determine local functional units that link muscle activity to surface tongue geometry during protrusion and simple speech tasks. We evaluated the performance of our approach using various simulated and human tongue motion datasets, demonstrating that our method surpassed conventional methods and accurately clustered the human tongue motion. Our results suggest that it is feasible to identify the functional units using a set of motion features, which has great potential in the improvement of diagnosis, treatment, and rehabilitation in patients with speechrelated disorders.
Acknowledgment
This research was in part supported by National Institutes of Health Grants R21DC016047, R00DC012575, R01DC014717, R01CA133015, and P41EB022544. The authors would like to thank Drs. Joseph Perkell and Sidney Fels for their valuable insights and helpful discussions. The authors also thank Sung Ahn for proofreading the text.
References
 [1] D. D. Lee, H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, 401(6755), pp. 788–791, 1999
 [2] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, B. W. Schuller, “A deep matrix factorization method for learning attribute representations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(3), 417429, 2017
 [3] M. Stone, M. A. Epstein, K. Iskarous, “Functional segments in tongue movement,” Clinical Linguistics & Phonetics, 18(68), pp. 507–21, 2004
 [4] E. Bizzi, V. Cheung, A. d’Avella, P. Saltiel, M. Tresch, M, “Combining modules for movement,” Brain Res. Rev. 57, 125–133.
 [5] W. M. Kier, K. K. Smith, “Tongues, tentacles and trunks: the biomechanics of movement in muscularhydrostats,” Zoological journal of the Linnean Society, 83(4), 307324, 1985

[6]
J. R. O’Kusky and M. G. Norman, “Sudden infant death syndrome: increased number of synapses in the hypoglossal nucleus,” Journal of Neuropathology and Experimental Neurology, 54, pp. 627–34, 1995
 [7] V. Parthasarathy, J. L. Prince, M. Stone, E. Z. Murano, M. NessAiver, “Measuring tongue motion from tagged cineMRI using harmonic phase (HARP) processing,” Journal of the Acoustical Society of America, 121(1), pp. 491–504, January, 2007
 [8] J. R. Green and Y.T. Wang “Tonguesurface movement patterns during speech and swallowing,” Journal of the Acoustical Society of America, 113(5), pp. 2820–33, 2003
 [9] J. A. S. Kelso, “Synergies: Atoms of Brain and Behavior,” In Progress in Motor Control, pp. 83–91, Springer, 2009
 [10] M. Stone, E. P. Davis, A. S. Douglas, M. N. Aiver, R. Gullapalli, W. S. Levine, A. J. Lundberg, “Modeling tongue surface contours from CineMRI images,” Journal of Speech, Language, and Hearing Research, 44(5), pp. 102640, 2001
 [11] E. Bresch, Y. C. Kim, K. Nayak, D. Byrd, S. Narayanan, “Seeing speech: Capturing vocal tract shaping using realtime magnetic resonance imaging,” IEEE Signal Processing Magazine, 25(3), pp. 123–132, 2008
 [12] M. Fu, B. Zhao, C. Carignan, R. K. Shosted, J. L. Perry, D. P. Kuehn, Z.‐P. Liang, and B. P. Sutton, “High‐resolution dynamic speech imaging with joint low‐rank and sparsity constraints,” Magnetic Resonance in Medicine, 73(5), pp. 182032, 2015
 [13] J. Woo, J. Lee, E. Murano, F. Xing, M. AlTalib, M. Stone and J. Prince, “A highresolution atlas and statistical model of the vocal tract from structural MRI,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–14, 2015
 [14] M. Stone, J. Woo, J. Lee, T. Poole, A. Seagraves, M. Chung, E. Kim, E. Z. Murano, J. L. Prince, S. S. Blemker, “Structure and variability in human tongue muscle anatomy,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–9, 2016
 [15] F. Xing, J. Woo, E. Z. Murano, J. Lee, M. Stone, J. L. Prince, “3D tongue motion from tagged and cine MR images,” International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), LNCS 8151, pp. 41–48, 2013
 [16] B. Gick and I. Stavness, “Modularizing speech,” Frontiers in psychology, 4, pp. 1–3, 2013
 [17] M. Stone, “A threedimensional model of tongue movement based on ultrasound and xray microbeam data,” Journal of the Acoustical Society of America, 87, pp. 2207–17, 1990
 [18] V. Ramanarayanan, L. Goldstein, and S. Narayanan, “Spatiotemporal articulatory movement primitives during speech production: extraction, interpretation, and validation,” Journal of the Acoustical Society of America, 134(2), pp. 1378–1394, 2013
 [19] J. Woo, F. Xing, J. Lee, M. Stone, J. L. Prince, “Determining Functional Units of Tongue Motion via Graphregularized Sparse Nonnegative Matrix Factorization,” International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 17(Pt 2), pp. 146–53, 2014
 [20] C. Browman and L. Goldstein, “Dynamics and articulatory phonology,” in Mind as Motion: Explorations in the Dynamics of Cognition, edited by R. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175–194, 1995
 [21] T. Li and C. H. Q. Ding, “Nonnegative Matrix Factorizations for Clustering: A Survey,” In: Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, pp. 149–176, 2013
 [22] N. F. Osman, W. S. Kerwin, E. R. McVeigh, J. L. Prince, “Cardiac motion tracking using CINE harmonic phase (HARP) magnetic resonance imaging,” Magnetic Resonance in Medicine, 42(6), pp. 1048–60, 1999
 [23] T. Mansi, X. Pennec, M. Sermesant, H. Delingette, and N. Ayache, “iLogDemons: A demonsbased registration algorithm for tracking incompressible elastic biological tissues,” International journal of computer vision, 92(1), pp. 92–111, 2011
 [24] F. Xing, C. Ye, J. Woo, M. Stone, J. L. Prince, “Relating speech production to tongue muscle compressions using tagged and highresolution magnetic resonance imaging,” SPIE Medical Imaging, Florida, FL, 2015
 [25] D. Cai, X. He, J. Han, T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), pp. 1548–1560, 2011

[26]
H. Shinagawa, E. Z. Murano, J. Zhuo, B. Landman, R. Gullapalli, and J. L. Prince, M. Stone, “Tongue muscle fiber tracking during rest and tongue protrusion with oral appliances: A preliminary study with diffusion tensor imaging,” Acoustical Science and Technology, 29(4), pp. 291–294, 2008
 [27] T. A. Gaige, T. Benner, R. Wang, V. J. Wedeen, R. J. Gilbert, R.J., “Three dimensional myoarchitecture of the human tongue determined in vivo by diffusion tensor imaging with tractography,” Journal of Magnetic Resonance Imaging, 26(3), pp. 654–661, 2007
 [28] M. NessAiver and J. L. Prince, “Magnitude image CSPAMM reconstruction (MICSR),” Magnetic Resonance in Medicine, 50(2), pp. 331–342, 2003
 [29] J. Woo, M. Stone, Y. Suo, E. Murano, J. L. Prince, “TissuePoint Motion Tracking in the Tongue From Cine MRI and Tagged MRI,” Journal of Speech, Language, and Hearing Research, 57(2), pp. 626–636, 2013
 [30] J. Kim and H. Park, “Sparse nonnegative matrix factorization for clustering,” Technical Report GTCSE0801, Georgia Institute of Technology, 2008
 [31] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Document clustering using nonnegative matrix factorization,” Information Processing and Management, 42(2), pp. 373–386, 2006
 [32] Y. Gao, G. Church, “Improving molecular cancer class discovery through sparse nonnegative matrix factorization,” Bioinformatics, 21(21), pp. 3970–3975, 2005
 [33] J. Wang, X. Wang, X. Gao, “Nonnegative matrix factorization by maximizing correntropy for cancer clustering,” BMC Bioinformatics, 14(1), pp. 1–11, 2013
 [34] A. Anderson, P. K. Douglas, W. T., Kerr, V. S. Haynes, A. L. Yuille, J. Xie, Y. N. Wu, J. A. Brown, M. S. Cohen, “Nonnegative matrix factorization of multimodal MRI, fMRI and phenotypic data reveals differential changes in default mode subnetworks in ADHD,” NeuroImage, 102(1), pp. 207–219, 2014
 [35] Q. Mo, and B. A. Draper, “Seminonnegative matrix factorization for motion segmentation with missing data,” European Conference on Computer Vision, pp. 402–415, Berlin, Heidelberg, 2012
 [36] A. Cheriyadat and R. J. Radke, “Nonnegative matrix factorization of partial track data for motion segmentation,” IEEE International Conference in Computer Vision, pp. 865–872, 2009
 [37] F. Xing, J. Woo, A. Gomez, D. L. Pham, P. V. Bayly, M. Stone, and J. L. Prince, “Phase Vector Incompressible Registration Algorithm (PVIRA) for Motion Estimation from Tagged Magnetic Resonance Images,” IEEE Transactions on Medical Imaging, 36(10), pp. 2116–2128, Oct, 2017
 [38] F. Xing, J. Woo, E. Z. Murano, J. Lee, M. Stone, J. L. Prince, “3D tongue motion from tagged and cine MR images,” International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Nagoya, Japan, pp. 865–872, Sep, 2013
 [39] A. Elgammal and C. S. Lee, “The role of manifold learning in human motion analysis,” In B. Rosenhahn, R. Klette, D. Metaxas, (eds.) Human Motion. Computational Imaging and Vision, 36, pp. 25–56, 2008
 [40] Y. Qian, S. Jia, J. Zhou, J., A. RoblesKelly, “Hyperspectral unmixing via L1/2 sparsityconstrained nonnegative matrix factorization,” IEEE Transactions on Geoscience and Remote Sensing, 49(11), pp. 4282–4297, 2011
 [41] S. Nithiananthan, K. K. Brock, M. J. Daly, H. Chan, J. C. Irish, and J. H. Siewerdsen, “Demons deformable registration for CBCT‐guided procedures in the head and neck: Convergence and accuracy,” Medical physics, 36(10), pp. 4755–4764, 2010

[42]
S. Yang, C. Hou, C. Zhang, Y. Wu, “Robust nonnegative matrix factorization via joint sparse and graph regularization for transfer learning,” Neural Computing and Applications, 23(2), pp. 541–559, 2013
 [43] X. Lu, H. Wu, Y. Yuan, P. Yan, X. Li, “Manifold regularized sparse NMF for hyperspectral unmixing,” IEEE Transactions on Geoscience and Remote Sensing, 51(5), pp. 2815–26, 2013
 [44] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, 17(4), pp. 395–416, 2007
 [45] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp. 888–905, 2000
 [46] S. Monti, P. Tamayo, J. Mesirov, T. Golub. “Consensus clustering: a resamplingbased method for class discovery and visualization of gene expression microarray data,” Machine learning 52(1–2), 91118, 2003
 [47] J.P. Brunet, P. Tamayo, T. R. Golub, J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the national academy of sciences, 101(12), pp. 4164–4169, 2004
 [48] I. Stavness, J. E. Lloyd and S. Fels, “Automatic prediction of tongue muscle activations using a finite element model,” Journal of Biomechanics, 45(16), pp. 2841–2848, 2012
 [49] L. Pittman and E. F. Bailey, “Genioglossus and intrinsic electromyographic activities in impeded and unimpeded protrusion tasks,” Journal of Neurophysiology, 101(1), pp. 276–282, 2009
 [50] Z. Zhang, Y., T. Li, C. Ding, XW. Ren, and X. S. Zhang, “Binary matrix factorization for analyzing gene expression data,” Data Mining and Knowledge Discovery, 20(1), pp. 28–52, 2010
 [51] D. Kuang, J. Choo, and H. Park, “Nonnegative matrix factorization for interactive topic modeling and document clustering,” In Partitional Clustering Algorithms, pp. 215–243, Springer, 2015

[52]
C. Boutsidis and E. Gallopoulos, “SVD based initialization: A head start for nonnegative matrix factorization,” Pattern Recognition, 41(4), pp. 1350–1362, 2008
 [53] L. H. Ting, S. A. Chvatal, “Decomposing muscle activity in motor tasks: methods and interpretation,” In Experiments Motor Control: Theories, and Applications, Danion, F., Latash, M. L., eds. Oxford University Press, New York, 2011
 [54] S. E. G. Ohman, “Numerical model of coarticulation,” Journal of the Acoustical Society of America, 41, pp. 310–320, 1967
 [55] P. Viviani, T. Flash, “Minimumjerk, twothirds power law, and isochrony: converging approaches to movement planning,” Journal of Experimental Psychology: Human Perception and Performance, 21(1), pp. 32–53, 1995
 [56] R. Tibshirani, G. Walther, T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), pp.411–423, 2001
 [57] B. Kulis and M. I. Jordan, “Revisiting kmeans: New algorithms via Bayesian nonparametrics,” 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, 2012.
 [58] J. Woo, F. Xing, M. Stone, J. R. Green, T. G. Reese, T. J. Brady, V. J. Wedeen, J. L. Prince, and G. El Fakhri, “Speech Map: A Statistical Multimodal Atlas of 4D Tongue Motion During Speech from Tagged and Cine MR Images,” Journal of Computer Methods in Biomechanics and Biomedical Engineering (CMBBE): Special Issue on Imaging and Visualization, 2017
 [59] J. Woo, M. Stone, and J. L. Prince, “Multimodal registration via mutual information incorporating geometric and spatial context,” IEEE Transactions on Image Processing, 24(2), 757–769, 2015
 [60] E. Lee, F. Xing, S. Ahn, T. G. Reese, R. Wang, J. R. Green, N. Atassi, V. J. Wedeen, G. El Fakhri, J. Woo, “Magnetic resonance imaging based anatomical assessment of tongue impairment due to amyotrophic lateral sclerosis: A preliminary study,” Journal of Acoustical Society of America, 143(4), EL248–EL254, 2018
Comments
There are no comments yet.