1 Motivation
One of the fundamental questions in medical practice is how diseases progress in individual patients. Accurate continuous monitoring of a patient’s condition could considerably improve prevention and treatment. Many medical tests, such as xray, MRI, motion capture gait analysis, biopsy, or blood tests are costly, harmful or inconvenient to perform frequently. Since in many situations increase in sampling is not feasible due to inconvenience and costs, practitioners need to reach out for statistical tools to analyze the dynamics of these measurements.
While in many situations multiple data points from patients’ histories are available, these data are often underutilized. For the sake of simplicity and convenience, many prognostic models applied in practice only use the last few observations from a series or summary statistics such as the mean over time. However, this simplification ignores important information about the progression, including its dynamics or individual patient’s variability. Moreover, the noise inherent to measurements further hinders the inference of changes in patient’s health. For making use of these data, practitioners need statistical models and software. To enable appropriate usage and broad adoption these tools should be simple to use and understand.
To illustrate the potential benefits of temporal models in these scenarios, in Figure 1 we present measurements of gait pathology progression. The thick solid line represents the estimated mean and indicates a clear trend of growth during puberty. However, by looking at individual processes and by modeling betweensubject similarities, we may model the progression more accurately. This approach could result in personalized predictions, new clustering techniques patients based on progression patterns, or extraction of latent representation of progression patterns, which then could be used as predictors in other models.
This kind of data has been commonly analyzed using linear mixed models, where we treat time as a random effect and nonlinearity of this effect is imposed by choice of the functional basis
(Zeger et al., 1988; Verbeke, 1997; McCulloch and Neuhaus, 2001). When data is very sparse, additional constraints on the covariance structure of trajectories are derived using crossvalidation or information criteria (Rice and Wu, 2001; Bigelow and Dunson, 2009). To further reduce the dimension, practitioners model the covariance as a lowrank matrix (James et al., 2000; Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). Multiple models were developed for incorporating additional covariates (Song et al., 2002; Liu and Huang, 2009; Rizopoulos et al., 2014). While statisticians use these methods in practice, they need to finetune them each time since the nature of processes differs in each clinical setting. Moreover, the probabilistic formulation of the model and dependence on the underlying distributions might hinder applicability or adoption to other practical problems.In this work, we propose a flexible and straightforward statistical framework, exploiting matrix factorization techniques. We focus on the simplicity of the formulation, and we implement software easy to use and extend.
2 Background and related work
In many clinical settings, researchers and practitioners model the patient’s history as a multivariate process of clinical measurements. Examples include variables such as the size of a tumor, blood markers, height and weight of a patient. The multivariate measurements are noisy, and they are observed on different timepoints. This situation arises when, for example, clinicians perform a blood test at every visit but a biopsy sporadically.
Even though multiple variables are measured, the ones directly related to the disease, such as the size of a tumor, are usually of the special interest. Therefore, to focus our attention and aid further discussion, we start by introducing notation and methodology for the univariate case. Let denote the number of subjects. For each individual , we measure the process at irregularly sampled timepoints . We assume that for each we have , for some .
Let denote observations corresponding to for each individual . To model individual trajectories given pairs practitioners map observations into a lowdimensional space which represents progression patterns. Ideally, a small distance between individuals in the latent space reflects similar progression patterns.
In this section, we discuss stateoftheart approaches to estimating this lowdimensional latent embedding. We classify them into three categories: the direct approach, mixedeffect models, and lowrank approximations.
2.1 Direct approach
If the set of observed values for each individual is dense, elementary interpolation using a continuous basis can be sufficient for approximating the entire trajectory. Let be a basis of . In practice, we truncate the basis to a set of the first basis elements. Let
be a vector of
basis elements evaluated at a timepoint . Throughout this article we use the word basis to refer to some truncated basis of elements.To find an individual trajectory, for each subject , we may use least squares method and estimate a set of coefficients , minimizing squared euclidean distance to the observed point
(2.1) 
Finally, to reduce overfitting, it is often convenient to compute principal functional components across all individuals and represent curves in the space spanned by the first few of them. Such representation, referred to as a KarhunenLoève expansion (Watanabe, 1965; Kosambi, 2016), has became a foundation of many functional data analysis workflows (Ramsay and Dalzell, 1991; Yao et al., 2005; Cnaan et al., 1997; Laird, 1988; Horváth and Kokoszka, 2012; Besse et al., 1997).
This approach to modeling covariance has two main drawbacks. First, if the number of observations for an individual is smaller or equal to the size of the basis
, we can fit a curve with no error leading to overfitting and unreliable estimator of the variance. Second, this approach ignores similarities between the curves, which could potentially improve the fit.
A basic idea to remedy these issues is to estimate both the basis coefficients and the variance structure simultaneously. Linear mixedeffect models provide a convenient solution to this problem.
2.2 Linear mixedeffect models
A common approach to modeling longitudinal observation is to assume that data come from a linear mixedeffect model (LMM) (Verbeke, 1997; Zeger et al., 1988). We operate in a functional space with a basis of elements and we assume there exists a fixed effect , where for for . We model the individual random effect as a vector of basis coefficients. In the simplest form, we assume
(2.2) 
where is a covariance matrix. We model individual observations as
(2.3) 
where ,
is the standard deviation of observation error and
is the basis evaluated at timepoints defined in. Estimation of the model parameters is typically accomplished by the expectationmaximization (EM) algorithm
(Laird and Ware, 1982). For estimating coefficients one can use the best unbiased linear predictor (BLUP) (Henderson, 1950; Robinson, 1991).Since the LMM estimates the covariance structure and the individual fit simultaneously, it reduces the problem of uncertain estimates of , present in the direct approach. However, this model is only applicable if we observe a relatively large number of observations per subject since we attempt to estimate coefficients for every subject.
To model trajectories from a small number of observations, practitioners further constrain the covariance structure. If we knew the functions which contribute the most to the random effect, we could fit a LMM in a smaller space spanned by these functions. We explore possibilities to learn the basis from the data using lowrank approximations of the covariance matrix.
2.3 Lowrank approximations
There are multiple ways to constrain the covariance structure. We can use crossvalidation or information criteria to choose the best basis, the number of elements or positions of spline knots (Rice and Wu, 2001; Bigelow and Dunson, 2009). Alternatively, we can place a prior on the covariance matrix (MacLehose and Dunson, 2009).
Another solution is to restrict the latent space to dimensions and learn from the data the mapping between the latent space and the basis. In the simplest scenario with Gaussian observation noise, observations can be modeled as
(2.4) 
following the notation from (2.3).
James et al. (2000) propose an EM algorithm for finding model parameters and latent variables in (2.4). In the expectation stage, they marginalize , while in the maximization stage, with assumed observed, they maximize the likelihood with respect to . The maximum likelihood, given , takes form
Another approach to estimating parameters of (2.4) is to optimize over and marginalize (Lawrence, 2004). This approach allows modifying the distance measure in the latent space, using the kernel trick (Schulam and Arora, 2016).
Estimation of the covariance structure of processes is central to the estimation of individual trajectories. Descary and Panaretos (2016) propose a method where the estimate of the covariance matrix is obtained through matrix completion.
Methods based on lowrank approximations are widely adopted and applied in practice (Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). However, due to their probabilistic formulation and reliance on the distribution assumptions, these models usually need to be carefully finetuned for specific situations. This shortcoming motivates us to develop an elementary optimization framework.
3 Modeling sparse longitudinal processes using matrix completion
The mixedeffect model, like any other probabilistic model, can be heavily biased when data comes from a distribution considerably different than assumed. In the medical context, since biomarkers can differ in every clinical setting, finetuning the models may require an extensive amount of time and expertise. In this work, we develop a more flexible approach based solely on the approximation rather than the underlying probabilistic distributions.
We pose the problem of trajectory prediction as a matrix completion problem, and we solve it using sparse matrix factorization techniques (Rennie and Srebro, 2005; Candès and Recht, 2009). In the classical matrix completion problem, the objective is to predict elements of a sparsely observed matrix using its known elements while minimizing a specific criterion, often chosen to be the Mean Squared Error (MSE). The motivating example is the “Netflix Prize” competition (Bennett and Lanning, 2007), where participants were tasked to predict unknown movie ratings using other observed ratings. We can represent these data as a matrix of users and movies, with a subset of known elements, measured on a fixed scale, e.g., .
To solve the matrix completion problem, we usually assume that the true matrix can be approximated by a lowrank matrix (Srebro et al., 2005). In the lowrank representation columns of spanning the space of movies can be interpreted as “genre” components, and each user is represented as a weighted sum of their preferred genres, i.e., a row in the matrix of latent variables (see Figure 2).
We can use the same idea to predict sparsely sampled curves, as long as we introduce an additional smoothing step. The lowdimensional latent structure now corresponds to progression patterns, and a trajectory of each individual can be represented as a weighted sum of these “principal” patterns. In Figure 2, the patterns are given by , while the individual weights are encoded in .
We first introduce a methodology for univariate sparselysampled processes. The direct method, mixedeffect models and lowrank approximations described in Section 2 have their analogy in the matrix completion setting. We discuss these analogies in sections 3.2 and 3.3. Next, we show that the simple representation of the problem allows for extension to multivariate sparselysampled processes and a regression setting.
3.1 Notation
For each individual we observe measurements at timepoints . Unlike in the prior work introduced in Section 2, here we discretize the time grid to timepoints where , and is arbitrarily large. Each individual is expressed as a partially observed vector . For each timepoint for we find a corresponding gridpoint . We assign . Let be a set of grid indices corresponding to observed gridpoints for an individual . All elements outside , i.e. are considered missing.
For sufficiently large, our observations can be uniquely represented as a matrix with missing values. We denote the set of all observed elements by pairs of indices as . Let be the projection onto observed indices, i.e. , such that for and otherwise. We define as the projection on the complement of , i.e. . We use to denote the Frobenius norm, i.e. the square root of the sum of matrix elements, and
to denote the nuclear norm, i.e. the sum of singular values.
As in Section 2 we impose smoothness by using a continuous basis . However, now, the basis is evaluated on the grid and we define a matrix .
In our algorithms we use a diagonalthresholding operators defined as follows. Let be a diagonal matrix. We define softthresholding as
where , and hardthresholding as
where .
3.2 Direct approach
The optimization problem (2.1) of the direct approach described in Section 2.1 can be approximated in the matrix completion setting. First, note that the bias introduced by the grid is negligible if the grid is sufficiently large. We have
(3.1) 
and by the continuity of the basis on a closed interval the second element in (3.1) can be arbitrarily small if . For the simplicity of notation in the reminder of this work we assume that all the points are observed on the grid of equidistributed points.
Now, we rewrite the optimization problem (2.1) as a matrix completion problem
(3.2) 
where by optimization over we mean optimization over all and is an matrix composed of vectors stacked vertically.
The matrix formulation in equation (3.2) and the classic approach in Section 2.1 share multiple characteristics. In both cases, if data is dense, we may find an accurate representation of the underlying process simply by fitting leastsquares estimates of or . Conversely, if the data is too sparse, the problem becomes illposed, and the leastsquares estimates can overfit.
However, representations (2.1) and (3.2) differ algebraically and this difference constitutes the foundation for the method introduced in the sequel of this paper. The matrix representation enables us to use the matrix completion framework and, in particular, in Section 3.4 we introduce convex optimization algorithms for solving (3.2).
Note that some lowrank constraints on the random effect from the mixedeffect model introduced in Section 2.2 can be expressed in terms of constraints on
and they can be potentially solved without imposing probability distributions. In particular in Section
3.3 we show that the linear mixedeffect model can be expressed by constraining .3.3 Lowrank matrix approximation
In the lowrank approach described in Section 2.3 we assume that individual trajectories can be represented in a lowdimensional space, by constraining the rank of .
We might attempt taking the same route for solving (3.2). One difficulty comes from the fact that optimization with a rank constraint on turns the original least squares problem into a nonconvex problem. Motivated by matrix completion literature, we relax the rank constraint in (3.2) to a nuclear norm constraint and we attempt to solve
(3.3) 
for some parameter .
Cai et al. (2010) propose a firstorder singular value thresholding algorithm (SVT), for solving a general class of problems involving a nuclear norm penalty and a linear form of , which includes (3.3). Their algorithm can handle large datasets and has strong guarantees on convergence, but it requires tuning the step size parameter, which can greatly influence performance. This limitation was addressed by Ma et al. (2011); Mazumder et al. (2010); Hastie et al. (2015) who introduced iterative procedures which do not depend on such tuning. Moreover, Hardt and Wootters (2014); Chen and Wainwright (2015) propose methods for the nonconvex problem and Ge et al. (2016) argue that the nonconvex problem has no spurious local minima.
In the last decade, the machine learning and statistical learning communities have introduced multiple algorithms for matrix completion. Many of them are suitable for solving (
3.3). However, in this article we focus on analyzing the benefits of framing the trajectory prediction problem (2.1) in the matrix completion framework, rather than on benchmarking existing solutions.We argue that the matrix formulation (3.3) has two key advantages over the probabilistic approach (Section 2.3). First, the problem (3.3) does not impose any assumption on the distribution of
or the distribution of the noise. This property is particularly important whenever the data does not follow the normal distribution. We further describe links between the matrix factorization and lowrank mixedeffect models in Section
3.6. Second, the simple formulation of (3.3) allows us to generalize this model to the multivariate or regression setting as we discuss in Section 4.3.4 Lowrank approximation with singular value thresholding
The lowrank probabilistic approach, introduced in Section 2.3, is based on the observation that the underlying processes for each subject can be approximated in a lowdimensional space. Here, we exploit the same characteristic using a matrixfactorization techniques for solving the optimization problem (3.3).
For the sake of clarity and simplicity, we choose to solve the problem (3.3) with an extended version of the
SoftImpute
algorithm designed by Hastie et al. (2015); Mazumder et al. (2010). As discussed in Section 3.3, many other convex optimization algorithms can be applied.The key component to the solution is the following property linking problem (3.3
) with the singular value decomposition (SVD). Consider the fully observed case of (
3.3). Using Theorem 2.1 in Cai et al. (2010), one can show that the optimization problem(3.4) 
has a unique solution , where and is the SVD of . We refer to as the singular value thresholding (SVT) of .
In order to solve (3.4) with a sparsely observed , we modify the SoftImpute algorithm to account for the basis . Our Algorithm 1 iteratively constructs better approximations of the global solution for each in some predefined set for . For a given approximation of the solution , we use to impute unknown elements of obtaining . Then, we construct the next approximation by computing SVT of .
As a stopping criterion, we compute the change between subsequent solution, relative to the magnitude of the solution, following the methodology in Cai et al. (2010). We set a fixed threshold for this criterion, depending on the desired accuracy.
A common bottleneck of algorithms introduced by Cai et al. (2010); Mazumder et al. (2010); Ma et al. (2011) as well as other SVTbased approaches is the computation of the SVD of large matrices. This problem is particularly severe in standard matrix completion settings, such as the Netflix problem, where the matrix size is over . However, in our problem,
(3.5) 
with in our motivating data example. While algorithms for large matrices are applicable here, the property (3.5) makes the computation of SVD feasible in practice with generic packages such as PROPACK (Larsen, 2004).
For making predictions on new data, in practice, we are interested in two scenarios: (1) we collect new measurements of an existing patient ; (2) we include a new patient with at least as many observations as the rank of . In both cases, we use the current fit for each parameter and update all models with newly observed data. This approach not only estimates new predictions but also marginally improves the existing model.
3.5 norm regularization
While the nuclear norm relaxation (3.3) is motivated by making the problem convex, Mazumder et al. (2010) argue that in many cases it can also give better results than the rank constraint. They draw an analogy to the relation between bestsubset regression ( regularization) and LASSO ( regularization as in Tibshirani (1996); Friedman et al. (2001)). In LASSO by shrinking model parameters, we allow more parameters to be included, what can potentially improve the prediction if the true subset is larger than the one derived through regularization. In the case of (3.3), shrinking the nuclear norm allows us to include more dimensions of again potentially improving the prediction if the true dimension is high.
Conversely, the same phenomenon can also lead to problems if the underlying dimension is small. In such case, shrinking may allow including unnecessary dimensions emerging from noise. To remedy this issue, following the analogy with penalized linear regression, we may consider another classes of penalties. In particular, we may consider coming back to the rank constraint by modifying the nuclear norm penalty (
3.3) to . We define . The problem(3.6) 
has a solution , where and is the SVD of . We refer to as the hard singular value thresholding (hard SVT) of . We use Algorithm 2 to find the hard SVT of .
3.6 The link between reducedrank model and SoftLongitudinalImpute
Intuitively we might expect similarity between the principal directions derived using the probabilistic approach (2.3) and their counterparts derived from the SVTbased approach. We investigate this relation by analyzing behavior of SVT for matrices sampled from the probabilistic model given by (2.3).
For simplicity, let us assume that and the data is fully observed on a grid of timepoints. Assume that observations come from the mixedeffect model
(3.7)  
where is an unknown covariance matrix of rank and variables are independent. By the spectral decomposition theorem we decompose , where is a orthogonal and is a diagonal matrix with positive coefficients in decreasing order. Since and are independent, the distribution (3.7) can be rewritten as
(3.8) 
Note that, the model (3.8) is a factor model with factors—the first columns of .
The following theorem constitutes a link between the mixedeffect model and SVT. It is adapted from Theorem 9.4.1 in Mardia et al. (1980), derived from Jöreskog (1967),
Theorem 1.
Let be the observed matrix and let . Then, is the maximum likelihood estimator of .
Factor analysis methods give not only estimates of and but also estimates of the individual latent variables
. In the multivariate analysis literature, there are multiple ways to estimate factor scores, i.e., a matrix
such that , most notably Spearman’s scores, Bartlett’s scores, and Thompson’s scores (Kim and Mueller, 1978). Simply taking as the estimate of the scores corresponds to the solution of (3.4) as long as .Note that in Theorem 1 we assume that is known, which is infeasible in practice. However, the likelihood of can be parametrized by , and we can find the optimal solution analytically. This corresponds to minimizing (3.4) for different .
This intuition is confirmed in our simulation study in Section 5 (see Figure 3
). Note that a similar analogy is drawn between the classical principal component analysis and probabilistic principal component analysis by
Tipping and Bishop (1999) and James et al. (2000).4 Multivariate longitudinal data
In practice, we are often interested in prediction of a univariate process in the context of other longitudinal measurements and covariates constant over time. Examples include prediction of disease progression given patient’s demographics, data from clinical visits at which multiple blood tests or physical exams are taken, or measurements which are intrinsically multivariate such as gene expression or xrays collected over time. The growing interest in this setting stimulated research in latent models (Sammel and Ryan, 1996) and multivariate longitudinal regression (Gray and Brookmeyer, 1998, 2000). Diggle et al. (2002)
present an example case study in which longitudinal multivariate modeling enables estimation of joint confidence region of multiple parameters changing over time, shrinking the individual confidence intervals.
In this section, we present an extension of our univariate framework to multivariate measurements sparse in time. We explore two cases: (1) dimensionality reduction, where we project sparsely sampled multivariate processes to a small latent space, and (2) linear regression, where we use a multivariate process and covariates constant over time for prediction of a univariate process of interest. To motivate our approach, we start with a regression involving two variables observed over time.
4.1 Motivation: Univariate regression
Suppose, as previously, that the true processes are in a lowdimensional space of some continuous functions (e.g., splines) and that we observe them with noise. More precisely, let
(4.1) 
for , where are vectors, are vectors and is a matrix of splines evaluated on a grid of points. We assume zeromean independent errors with fixed covariance matrices respectively, and that the true processes underlying the observed and follow a linear relation in the spline space, i.e.
(4.2) 
where is an unknown matrix.
Let be matrices formed from stacked vertically. From (4.2) we have , while (4.1) implies
(4.3) 
where are matrices of observation noise. Without loss of generality we assume that is orthonormal. We have freedom to choose the basis and any basis can be orthogonalized using, for example, singular value decomposition.
Note that due to orthogonality of and after multiplying both expressions in (4.3) by we can map the problem to the classical multivariate errorsinvariables models. Let
(4.4) 
where and . In errorsinvariables models it is assumed that the parameters and are unknown, and are to be estimated. Both regressors and responses are measured with noise (here and ). The parameter can be interpreted as a latent representation of both and .
The problem of estimating parameters in (4.4) has been extensively studied in literature dating back to Adcock (1878). Two main methods for estimating parameters in (4.4) are maximum likelihood estimates (MLE) and generalized least squares estimators (GLSE). The estimators in MLE are derived under the assumption of certain distributions of the errors. In GLSE, the only assumption about errors is that they are independent, zeromean, and they have a common covariance matrix. Then, and are zeromean and estimates for and can be found by optimizing some norm of these expressions. Gleser and Watson (1973) show that in the nointercept model for and of the same size (as in our case) and under the assumption of Gaussian errors, MLE and GLSE give the same estimates of , if GLSE are derived for the Frobenius norm.
In this work, we focus on the GLSE approach as it can be directly solved in our matrix factorization framework and we find it easier to deploy and extend in practice. To account for different magnitudes of the noise in and , we consider the optimization problem with weights
(4.5) 
where . Let . Then (4.5) can be transformed to
(4.6) 
where operator stacks vertically matrices with the same number of rows. To solve (4.6), we show that the SVD of truncated to the first dimensions, can be decomposed to . Let be the SVD of , with
where each and is a matrix for and each is a matrix for . By Lemma 2.1 and Lemma 2.2 in (Gleser, 1981) matrix is almost surely nonsingular. Therefore, almost surely exists and we can transform the decomposition such that and , solving (4.6).
For partially observed data, if they are very sparse, it might be essential to constrain the rank of the solution. The partiallyobserved and rankconstrained version of the problem (4.6) takes the form
where is the desired rank of the solution and is a projection on
As previously, for an unknown we can introduce a rank penalty using the nuclear norm
(4.7) 
The algorithm in the general case of multiple processes is derived in Section 4.2.
Although we motivate the problem as a regression of on , note that and are symmetric in (4.6). The lowrank matrix is, therefore, a joint lowrank representation of matrices and and thus our method can be seen as a dimensionality reduction technique or as a latent space model. In Section 4.2 we extended this idea to a larger number of variables. In Section 4.3 we discuss how this approach can be used for regression.
The linear structure of (4.3) allows us to draw analogy not only to the errorsinvariables models but also to vast literature on canonical correlation analysis (CCA), partial least squares (PLS), factor analysis (FA), and linear functional equation (LFE) models. Borga et al. (1997) show that solutions of CCA and PLS can also be derived from the SVD of stacked matrices, as we did with in (4.6). Gleser (1981) thoroughly discusses the relation between errorsinvariables, FA and LFE.
Finally, note that the method of using SVD for stacked matrices has also been directly applied in the recommender systems context. Condli et al. (1999) showed that for improving prediction of unknown entries in some partially observed matrix one might consider a lowrank decomposition of , where are additional covariates for each row, e.g., demographic features of individuals.
4.2 Dimensionality reduction
Suppose that for every individual we observe multiple variables varying in time (e.g. results of multiple medical tests at different times in a clinic) and we want to find a projection on maximizing the variance explained for some . This projection would correspond to characterizing patients by their progression trends of multiple metrics simultaneously.
We extend the equation (4.6) to account for a larger number of covariates and we do not impose decomposition of the solution yet. We formulate the following optimization problem
where are some matrices corresponding to the processes measured on a grid of points, is a basis evaluated on the same grid and orthogonalized (a matrix), and is a matrix.
Let be the Kronecker product of identity matrix and , i.e. a matrix with stacked times on the diagonal, and let . Note that is orthogonal and therefore results developed in Section 2.3 apply here. In particular, singular value thresholding solves
and we can use Algorithm 1 for solving
(4.8) 
where is the projection on observed indices .
Note that (4.8) can be further extended. First, we can allow for different basis for each process. Let be a basis selected for approximation of processes for , where each basis is dimensional and it is evaluated on some timegrid . In particular, we allow , which corresponds to a covariate constant in time for a given individual (e.g. gender). Second, if measurements are on different scales, we may normalize them using scaling factors . Then, the problem (4.8) takes form
which we solve with the same techniques as (4.8). Third, the observed indices may vary in each process allowing for, for example, dense measurements of one process and sparse measurements of another one.
4.3 Regression
In practice, we are interested in estimating progression of an individual parameter (e.g., cancer growth) given some individual features constant over time (e.g., demographics) or given progressions of other covariates (e.g., blood tests, vitals, biopsy results).
We start with a regression problem with fixed covariates and sparsely observed response trajectories. Let be a matrix of observed covariates, be a sparsely observed matrix of trajectories, and be a matrix representing a basis of splines evaluated on a grid of points. We consider the optimization problem
(4.9) 
where is a matrix and is a projection on the observed indices . To solve (4.9) we propose an iterative Algorithm 3.
Suppose we want to incorporate other variables varying in time for prediction of the response process. We can directly apply the method proposed in Section 4.2 and model the response and regressors together. However, it might be suboptimal for prediction, as it optimizes for leastsquares distance in all variables rather than only the response. This difference is analogical to the difference between regression line of some univariate on independent variables and the first principal component analysis of , used for prediction of . While the first one minimizes the distance to , the latter minimizes the distance to , which is usually less efficient for predicting .
Alternatively, we can use methodology from Section 4.2 only for covariates. The solution of (4.8) can be decomposed into , and we regress on as in (4.9). Let be an orthogonal matrix derived from , where is the numbers of latent components. We search for a matrix solving
(4.10) 
where is a projection on a set of observed coefficients. We propose a twostep iterative procedure (Algorithm 4).
5 Simulations
We illustrate properties of the multivariate longitudinal fitting in a simulation study. First, we generate curves with quickly decaying eigenvalues of covariance matrices. Then, we compare the performance of the methods in terms of the variance explained by the predictions.
5.1 Data generation
Let be a grid of equidistributed points and let be a basis of spline functions evaluated on the grid . We simulate three matrices using the same procedure , where :

Define the procedure generating symmetric matrices for a given vector :

Simulate matrix

Use SVD to decompose to

Return , where is a diagonal matrix with on the diagonal


Let , and , where
denotes the multivariate Gaussian distribution

Draw vectors according to the distribution

Return a matrix with rows .
Define
and
and let be generated following the procedure and let . We consider and as coefficients in a spline space . We derive corresponding functions by multiplying these matrices by , i.e. , and . We set .
We uniformly sample indices , i.e. around points per curve on average. Each observed element of each matrix and is drawn with Gaussian noise with mean and standard deviation . The task is to recover from sparse observed elements .
5.2 Methods
We compare SoftLongitudinalImpute (SLI) defined in Algorithm 1 with the fPCA procedure (James et al., 2000), implemented in Peng and Paul (2009). Both SLI and fPCA require a set of basis functions. In both cases, we use the same basis as in the data generation process. In SLI we also need to specify the tuning parameter , while in fPCA we need to choose the rank . We use crossvalidation to choose and by optimizing for the prediction error on heldout (validation) observations.
We divide the observed coefficients into training (), validation () and test () sets. We choose the best parameters of the three models on the validation set and then retrain on the entire training and validation sets combined. We compute the error of entire curves by taking mean squared Frobenius distance between and estimated , i.e.
(5.1) 
on the test set .
For illustration, we also present results of SparseLongitudinalRegression (SLR), regressing on and . Note, however, that the SLR, unlike two other methods, uses additional information about contained in and . This comparison is only meant to validate our approach.
We train the algorithms with all combinations of parameters: regularization parameter for SLI and SLR procedures and the rank for fPCA procedure . We define the grid of points. We compare the three methods fPCA, SLI and SLR, to the baseline null model which we define as the population mean across all visits.
5.3 Results
The SLI achieves performance similar to (James et al., 2000), as presented in Table 1. The SLR, having access to additional information about , clearly outperforms other methods validating its correctness.
MSE  % std  

fPCA  0.124  0.03 
SLI  0.121  0.03 
SLR  0.064  0.03 
In Figure 3 we present the first components derived from both sparse functional PCA and SLI. In Figure 4 we present example predictions from all four methods. In Figure 5, we present the estimated rank and crossvalidation error of one of the simulation runs.
6 Data study
The target class of problems motivating this study is a prediction of trajectories of disease progression of individual patients from sparse observations. In this section, we present how our methods are applied for understanding the progression of neurological disorders leading to motor impairments and gait pathologies. First, we discuss how practitioners collect the data and use them to guide the decision process. Next, we describe our dataset and present how our methodology can improve current workflows. Then, we present and discuss results.
In clinical gait analysis, at each visit movement of a child is recorded using optical motion capture. Optical motion capture allows estimating 3D positions of body parts using a set of cameras tracking markers positions on the subject’s body. A set of at least three markers is placed at each analyzed body segment usually associated with bones (e.g., tibia, humerus) so that its 3D position and orientation can be identified uniquely. These data are then used to determine relative positions of body segments by computing the angle between the corresponding planes. Typically it is done using a biomechanical model for enforcing biomechanical constraints and improving accuracy.
In gait analysis practitioners are usually concerned about movement pattern of nine joints in lower limbs: ankle, knee, hip in each leg and pelvis (Figure 7). Each joint angle is measured in time. For making the curves comparable between the patients, usually, the time dimension is normalized to the percentage of the gait cycle, defined as the time between two foot strikes (Figure 7).
While trajectories of joint angles are a piece of data commonly used by practitioners for taking decisions regarding treatment, their highdimensional nature hinders their use as a quantitative metric of gait pathology or treatment outcome. This motivates development of univariate summary metrics of gait impairment, such as questionnairebased metrics Gillette Functional Assessment Walking Scale (FAQ) (Gorton III et al., 2011), Gross Motor Function Classification System (GMFCS) (Palisano et al., 2008) and Functional Mobility Scale (FMS) (Graham et al., 2004), or observational video analysis scores such as Edinburgh Gait Score (Read et al., 2003).
One of the most widely adopted quantitative measurements of gait impairments in pediatrics is Gait Deviation Index (GDI) (Schwartz and Rozumalski, 2008). GDI is derived from joint angle trajectories and measures deviation of the first ten singular values from the population average of the typically developing population. GDI is normalized in such a way that corresponds to the mean value of typically developing children, with the standard deviation equal . It is proven to be highly correlated with questionnairebased methods. Thanks to its deterministic derivation from the motion capture measurements this method is considered more objective than questionnaires.
In medical practice, GDI has been adapted as a metric for diagnosing the severity of impairment, and it constitutes an integral part of the clinical decision making process and evaluation of treatment outcomes. However, in order to correctly identify the surgery outcome, it is crucial to understand the natural progression of GDI. In particular, a positive outcome of a surgery might be negligible when compared to natural improvement during puberty. Similarly, a surgery maintaining the same level of GDI might be incorrectly classified as a failure, if the decline in patient’s function over time is not accounted for.
Methods introduced in this article can be used to approximate individual progressions of GDI. First, we present how a prediction can be made solely based on the patient’s GDI history and histories of other patients. Next, using our regression procedure, we predict GDI trajectories using other sparsely observed covariates, namely O2 expenditure and walking speed.
6.1 Materials and methods
We analyze a dataset of Gillette Children’s Hospital patients visiting the clinic between 1994 and 2014, age ranging between 4 and 19 years, mostly diagnosed with Cerebral Palsy. The dataset contains visits of patients without gait disorders and visits of patients with gait pathologies.
Motion capture data was collected at 120Hz and joint angles in time were extracted. These joint angles were then normalized in time to the gait cycle, resulting in curves as in Figure 7. Points from these curves were then subsampled (51 equidistributed points). Given the data in this form, we computed GDI metric from each visit and each leg.
In the dataset which we received from the hospital, for each patient we observe their birthday and disease subtype. From each visit, we observe the following variables: patient ID, time of the visit, GDI of the left leg, GDI of the right leg, walking speed, and O2 expenditure. Other clinical variables that we received were not included in this study. Note that walking speed is related to information we lose during normalization of the gait cycle in time. O2 expenditure is a measure of a subject’s energy expenditure during walking. Pathological gait is often energy inefficient and reduction of O2 expenditure is one of the objectives of treatments. Finally, GDI is computed for two legs while in many cases the neurological disorder affects only one limb. To simplify the analysis, we focus on the more impaired limb by analyzing the minimum of the left and the right GDI.
Our objective is to model individual progression curves. We test three methods: functional principal components (fPCA), SoftLongitudinalImpute (SLI) and SparseLongitudinalRegression (SLR). We compare the results to the null model – the population mean across all visits (mean
). In SLR, we approximate GDI using latent variables of sparsely observed covariates O2 expenditure and walking speed, following the methodology from Section 4.3.
In our evaluation procedure, for the test set, we randomly select of observations of patients who visited the clinic at least times. Then, we split the remaining of observations into a training and validation sets in proportion. We train the algorithms with the following combinations of parameters: the regularization parameter for SLI and SLR procedures and the rank for fPCA procedure . We define the grid of points. We repeat the entire evaluation procedure times.
Let us denote the test set as . We validate each model on heldout indices by computing the mean squared error as defined in (5.1). We select the parameters of each of the three methods using crossvalidation, using the same validation set.
For reproducibility and to simplify adoption we created an R
package fcomplete
available at https://github.com/kidzik/fcomplete
. The package contains implementations of algorithms 1, 2, 3, and 4, and helper functions for transforming the data, sampling training and test datasets, and plotting functions. For convenience, we also provided an interface for using the fpca
package implementing Sparse Functional Principal Components algorithms (James et al., 2000; Peng and Paul, 2009). The analysis was perform on a desktop PC with 32 GB RAM memmory and an Intel^{®} Core^{™} i76700K CPU @ 4.00GHz, operating on a Ubuntu 18.04 system with R
version 3.4.4.
mean  sd  

fPCA  0.73  0.16 
SLI  0.70  0.10 
SLR  0.68  0.08 
6.2 Results
Compared to the null model, all three methods explain around of the variance. We present detailed results in Table 2. SparseLongitudinalRegression had smaller mean and variance than two other methods indicating that two other variables. We conclude that O2 expenditure and walking speed provide additional information for prediction of GDI progression.
Both fPCA and SparseLongitudinalImpute provide latent representations of patients’ progression curves which can potentially be interpreted. To this end we first analyze the singular value vectors from our SVD solution which we refer to as principal components.
In the left plot in Figure 9 we show the first two estimated principal components. We found that the first component estimates the change between GDI before and after age of 20. The second component models changes around age of 10 and around age of 18. In the right plof in Figure 9, by adding a principle component to the population mean curve, we illustrate how differences in the first component are reflected in the patients trajectory. By visual investigation of curves returned by our SparseLongitudinalImpute and by fPCA we found similar trends in the first two components.
Since our SVD decomposition defines a lowrank representation of the progression trends, we can also use it to gain insights on progression in different groups of patients. In Cerebral Palsy we divide paralysis into subtypes depending on which limbs are affected: monolegia (one leg), diplegia (two legs), hemiplegia (one side of the body), triplegia (three limbs), quadriplegia (four limbs). Hemiplegia is the most prevalent in our population and it might be divided depending on severity, from type I (weak muscles, drop foot) to type IV (severe spasticity). We find differences between trends of progression for different subtypes of paralysis of patients (). We illustrate these distributions in Figure 9.
7 Discussion
Results presented in Section 6.2 imply that our SparseLongitudinalImpute and SparseLongitudinalRegression methods can be successfully applied to understands trends of variability of disease progression. We show how to incorporate progressions of O2 expenditure and walking speed in the prediction of the progression of GDI. We present how lowrank representation can be leveraged to gain insights about subtypes of impairment.
While a large portion of variance remains unexplained, it is important to note that in practice the individual progression is not accounted for explicitly in the current decisionmaking process. Instead, practitioners only use the populationlevel characteristics of the dependence between age and impairment severity. Our model can greatly improve this practice.
Despite successful application, we identify limitations that could be potentially addressed in the extensions of our model. First, the method is meant to capture natural continuous progression of GDI, while in practice there are many discrete events, such as surgeries that break continuity assumption and render the mean trajectories less interpretable. Second, our methodology does not address the “cold start problem”, i.e. we do not provide tools for predictions with only one or zero observations. Third, we do not provide explicit equations for confidence bounds of predicted parameters.
While these and other limitations can constrain applicability of the method in the current form, they can be addressed using existing techniques of matrix completion. The focus of this paper is to introduce a computational framework rather than build a full solution for all cases. Elementary formulation of the optimization problem as well as the fullyfunctional R
implementation can foster development of new tools using matrix completion for longitudinal analysis and for mixedeffect models.
References
 Adcock (1878) Robert James Adcock. A problem in least squares. The Analyst, 5(2):53–54, 1878.
 Bennett and Lanning (2007) James Bennett and Stan Lanning. Netflix: the netflix prize. In KDD Cup and Workshop in conjunction with KDD, 2007.
 Berkey and Kent (1983) CS Berkey and RL Kent. Longitudinal principal components and nonlinear regression models of early childhood growth. Annals of human biology, 10(6):523–536, 1983.
 Besse and Ramsay (1986) Philippe Besse and James O Ramsay. Principal components analysis of sampled functions. Psychometrika, 51(2):285–311, 1986.
 Besse et al. (1997) Philippe C Besse, Hervé Cardot, and Frédéric Ferraty. Simultaneous nonparametric regressions of unbalanced longitudinal data. Computational Statistics & Data Analysis, 24(3):255–270, 1997.
 Bigelow and Dunson (2009) Jamie L Bigelow and David B Dunson. Bayesian semiparametric joint models for functional predictors. Journal of the American Statistical Association, 104(485):26–36, 2009.
 Borga et al. (1997) Magnus Borga, Tomas Landelius, and Hans Knutsson. A unified approach to pca, pls, mlr and cca. Linköping University, Department of Electrical Engineering, 1997.
 Cai et al. (2010) JianFeng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
 Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
 Chen and Wainwright (2015) Yudong Chen and Martin J Wainwright. Fast lowrank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
 Cnaan et al. (1997) Avital Cnaan, NM Laird, and Peter Slasor. Tutorial in biostatistics: using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat Med, 16(2349):80, 1997.
 Condli et al. (1999) Michelle Keim Condli, David D Lewis, David Madigan, and Christian Posse. Bayesian mixedeffects models for recommender systems. In ACM SIGIR, volume 99, 1999.
 Descary and Panaretos (2016) MarieHélène Descary and Victor M Panaretos. Functional data analysis by matrix completion. arXiv preprint arXiv:1609.00834, 2016.
 Diggle et al. (2002) Peter Diggle, Patrick Heagerty, KungYee Liang, and Scott Zeger. Analysis of longitudinal data. Oxford University Press, 2002.
 Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
 Ge et al. (2016) Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 Gleser (1981) Leon Jay Gleser. Estimation in a multivariate" errors in variables" regression model: large sample results. The Annals of Statistics, pages 24–44, 1981.

Gleser and Watson (1973)
Leon Jay Gleser and Geoffrey S Watson.
Estimation of a linear transformation.
Biometrika, 60(3):525–534, 1973.  Gorton III et al. (2011) George E Gorton III, Jean L Stout, Anita M Bagley, Katherine Bevans, Tom F Novacheck, and Carole A Tucker. Gillette functional assessment questionnaire 22item skill set: factor and rasch analyses. Developmental Medicine & Child Neurology, 53(3):250–255, 2011.
 Graham et al. (2004) H Kerr Graham, Adrienne Harvey, Jillian Rodda, Gary R Nattrass, and Marinis Pirpiris. The functional mobility scale (fms). Journal of Pediatric Orthopaedics, 24(5):514–520, 2004.
 Gray and Brookmeyer (1998) Sarah M Gray and Ron Brookmeyer. Estimating a treatment effect from multidimensional longitudinal data. Biometrics, pages 976–988, 1998.

Gray and Brookmeyer (2000)
Sarah M Gray and Ron Brookmeyer.
Multidimensional longitudinal data: estimating a treatment effect from continuous, discrete, or timetoevent response variables.
Journal of the American Statistical Association, 95(450):396–406, 2000.  Greven et al. (2011) Sonja Greven, Ciprian Crainiceanu, Brian Caffo, and Daniel Reich. Longitudinal functional principal component analysis. Recent Advances in Functional Data Analysis and Related Topics, pages 149–154, 2011.
 Hall et al. (2006) Peter Hall, HansGeorg Müller, and JaneLing Wang. Properties of principal component methods for functional and longitudinal data analysis. The annals of statistics, pages 1493–1517, 2006.
 Hardt and Wootters (2014) Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. In Conference on Learning Theory, pages 638–678, 2014.
 Hastie et al. (2015) Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and lowrank svd via fast alternating least squares. Journal of Machine Learning Research, 16:3367–3402, 2015.
 Henderson (1950) Charles R Henderson. Estimation of genetic parameters. In Biometrics, volume 6, pages 186–187. International biometric soc, 1950.
 Horváth and Kokoszka (2012) Lajos Horváth and Piotr Kokoszka. Inference for functional data with applications, volume 200. Springer Science & Business Media, 2012.
 James et al. (2000) Gareth M James, Trevor J Hastie, and Catherine A Sugar. Principal component models for sparse functional data. Biometrika, pages 587–602, 2000.
 Jöreskog (1967) Karl G Jöreskog. Some contributions to maximum likelihood factor analysis. Psychometrika, 32(4):443–482, 1967.
 Kim and Mueller (1978) JaeOn Kim and Charles W Mueller. Factor analysis: Statistical methods and practical issues, volume 14. Sage, 1978.
 Kosambi (2016) DD Kosambi. Statistics in function space. In DD Kosambi, pages 115–123. Springer, 2016.

Laird (1988)
Nan M Laird.
Missing data in longitudinal studies.
Statistics in medicine, 7(12):305–315, 1988.  Laird and Ware (1982) Nan M Laird and James H Ware. Randomeffects models for longitudinal data. Biometrics, pages 963–974, 1982.
 Larsen (2004) Rasmus Munk Larsen. Propacksoftware for large and sparse svd calculations. Available online. URL http://sun. stanford. edu/rmunk/PROPACK, pages 2008–2009, 2004.

Lawrence (2004)
Neil D Lawrence.
Gaussian process latent variable models for visualisation of high dimensional data.
In Advances in neural information processing systems, pages 329–336, 2004.  Liu and Huang (2009) Lei Liu and Xuelin Huang. Joint analysis of correlated repeated measures and recurrent events processes in the presence of death, with application to a study on acquired immune deficiency syndrome. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(1):65–81, 2009.
 Ma et al. (2011) Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1):321–353, 2011.
 MacLehose and Dunson (2009) Richard F MacLehose and David B Dunson. Nonparametric bayes kernelbased priors for functional data analysis. Statistica Sinica, pages 611–629, 2009.
 Mardia et al. (1980) Kantilal V Mardia, John T Kent, and John M Bibby. Multivariate analysis (probability and mathematical statistics), 1980.
 Mazumder et al. (2010) Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
 McCulloch and Neuhaus (2001) Charles E McCulloch and John M Neuhaus. Generalized linear mixed models. Wiley Online Library, 2001.
 Palisano et al. (2008) Robert J Palisano, Peter Rosenbaum, Doreen Bartlett, and Michael H Livingston. Content validity of the expanded and revised gross motor function classification system. Developmental Medicine & Child Neurology, 50(10):744–750, 2008.
 Peng and Paul (2009) Jie Peng and Debashis Paul. A geometric approach to maximum likelihood estimation of the functional principal components from sparse longitudinal data. Journal of Computational and Graphical Statistics, 18(4):995–1015, 2009.
 Ramsay and Dalzell (1991) James O Ramsay and CJ Dalzell. Some tools for functional data analysis. Journal of the Royal Statistical Society. Series B (Methodological), pages 539–572, 1991.
 Read et al. (2003) Heather S Read, M Elizabeth Hazlewood, Susan J Hillman, Robin J Prescott, and James E Robb. Edinburgh visual gait score for use in cerebral palsy. Journal of pediatric orthopaedics, 23(3):296–301, 2003.
 Rennie and Srebro (2005) Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713–719. ACM, 2005.
 Rice and Wu (2001) John A Rice and Colin O Wu. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics, 57(1):253–259, 2001.
 Rizopoulos et al. (2014) Dimitris Rizopoulos, Laura A Hatfield, Bradley P Carlin, and Johanna JM Takkenberg. Combining dynamic predictions from joint models for longitudinal and timetoevent data using bayesian model averaging. Journal of the American Statistical Association, 109(508):1385–1397, 2014.
 Robinson (1991) George K Robinson. That blup is a good thing: the estimation of random effects. Statistical science, pages 15–32, 1991.
 Sammel and Ryan (1996) Mary Dupuis Sammel and Louise M Ryan. Latent variable models with fixed effects. Biometrics, pages 650–663, 1996.
 Schulam and Arora (2016) Peter Schulam and Raman Arora. Disease trajectory maps. In Advances in Neural Information Processing Systems, pages 4709–4717, 2016.
 Schwartz and Rozumalski (2008) Michael H Schwartz and Adam Rozumalski. The gait deviation index: a new comprehensive index of gait pathology. Gait & posture, 28(3):351–357, 2008.
 Song et al. (2002) Xiao Song, Marie Davidian, and Anastasios A Tsiatis. A semiparametric likelihood approach to joint modeling of longitudinal and timetoevent data. Biometrics, 58(4):742–753, 2002.
 Srebro et al. (2005) Nathan Srebro, Noga Alon, and Tommi S Jaakkola. Generalization error bounds for collaborative prediction with lowrank matrices. In Advances In Neural Information Processing Systems, pages 1321–1328, 2005.
 Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Tipping and Bishop (1999) Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
 Verbeke (1997) Geert Verbeke. Linear mixed models for longitudinal data. In Linear mixed models in practice, pages 63–153. Springer, 1997.
 Watanabe (1965) Satosi Watanabe. Karhunenloeve expansion and factor analysis. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, Prague, 1965, 1965.
 Yan et al. (2017) Fangrong Yan, Xiao Lin, Xuelin Huang, et al. Dynamic prediction of disease progression for leukemia patients by functional principal component analysis of longitudinal expression levels of an oncogene. The Annals of Applied Statistics, 11(3):1649–1670, 2017.
 Yao and Lee (2006) Fang Yao and Thomas Lee. Penalized spline models for functional principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):3–25, 2006.

Yao et al. (2005)
Fang Yao, HansGeorg Müller, JaneLing Wang, et al.
Functional linear regression analysis for longitudinal data.
The Annals of Statistics, 33(6):2873–2903, 2005.  Zeger et al. (1988) Scott L Zeger, KungYee Liang, and Paul S Albert. Models for longitudinal data: a generalized estimating equation approach. Biometrics, pages 1049–1060, 1988.
Appendix A Proofs
We prove convergence by mapping our problem into the framework of Mazumder et al. (2010). Following their notation we define
Comments
There are no comments yet.