# Mean and Covariance Estimation for Functional Snippets

We consider estimation of mean and covariance functions of functional snippets, which are short segments of functions possibly observed irregularly on an individual specific subinterval that is much shorter than the entire study interval. Estimation of the covariance function for functional snippets is challenging since information for the far off-diagonal regions of the covariance structure is completely missing. We address this difficulty by decomposing the covariance function into a variance function component and a correlation function component. The variance function can be effectively estimated nonparametrically, while the correlation part is modeled parametrically, possibly with an increasing number of parameters, to handle the missing information in the far off-diagonal regions. Both theoretical analysis and numerical simulations suggest that this hybrid strategy divide-and-conquer strategy is effective. In addition, we propose a new estimator for the variance of measurement errors and analyze its asymptotic properties. This estimator is required for the estimation of the variance function from noisy measurements.

## Authors

• 17 publications
• 11 publications
05/16/2019

### Basis Expansions for Functional Snippets

Estimation of mean and covariance functions is fundamental for functiona...
05/16/2019

### Analytic Basis Expansions for Functional Snippets

Estimation of mean and covariance functions is fundamental for functiona...
01/30/2021

### Gaussian Process for Functional Data Analysis: The GPFDA Package for R

We present and describe the GPFDA package for R. The package provides fl...
03/13/2019

### Simultaneous Confidence Band for Stationary Covariance Function of Dense Functional Data

Inference via simultaneous confidence band is studied for stationary cov...
11/21/2017

### Partially Observed Functional Data: The Case of Systematically Missing Parts

By using a detour via the fundamental theorem of calculus, we propose a ...
09/16/2020

### Intrinsic Riemannian Functional Data Analysis for Sparse Longitudinal Observations

A novel framework is developed to intrinsically analyze sparsely observe...
08/25/2021

### Depth-based reconstruction method for incomplete functional data

The problem of estimating missing fragments of curves from a functional ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Functional data are random functions on a common domain, e.g., an interval . In reality they can only be observed on a discrete schedule, possibly intermittently, which leads to an incomplete data problem. Luckily, by now this problem has largely been resolved (Rice and Wu, 2001; Yao et al., 2005; Li and Hsing, 2010; Zhang and Wang, 2016) and there is a large literature on the analysis of functional data. For a more comprehensive treatment readers are referred to the monographs by Ramsay and Silverman (2005), Ferraty and Vieu (2006), Hsing and Eubank (2015) and Kokoszka and Reimherr (2017), and a review paper by Wang et al. (2016).

In this paper, we address a different type of incomplete data, which occurs frequently in longitudinal studies when subjects enter the study at random time and are followed for a short period within the domain

. Specifically, we focus on functional data with the following property: each function is only observed on a subject-specific interval , and

1. there exists an absolute constant such that and for all .

As a result, the design of support points (Yao et al., 2005) where one has information about the covariance function is incomplete in the sense that there are no design points in the off-diagonal region, . This is mathematically characterized by

 (⋃i[Ai,Bi]2)∩Tcδ=∅. (1)

Consequently, local smoothing methods, such as PACE (Yao et al., 2005)

, that are interpolation methods fail to produce a consistent estimate of the covariance function in the off-diagonal region as the problem requires data extrapolation.

An example is the spinal bone mineral density data collected from 423 subjects ranging in age from 8.8 to 26.2 years (Bachrach et al., 1999). The design plot for the covariance function, as shown in Figure 1, indicates that all of the design points fall within a narrow band around the diagonal area but the domain of interest is much larger than this band. The cause of this phenomenon is that each individual trajectory is only recorded in an individual specific subinterval that is much shorter than the span of the study. For the spinal bone mineral density data, the span (length of interval between the first measurement and the last one) for each individual is no larger than 4.3 years, while the span for the study is about 17 years. Data with this characteristic, mathematically described by a or (1), are called functional snippets in this paper, analogous to the longitudinal snippets studied in Dawson and Müller (2018). As it turns out, functional snippets are quite common in longitudinal studies (Raudenbush and Chan, 1992; Galbraith et al., 2017) and require extrapolation methods to handle. Usually, this is not an issue for parametric approaches, such as linear mixed-effects models, but requires a thoughtful plan for non- and semi-parametric approaches.

Functional fragments (Liebl, 2013; Kraus, 2015; Kraus and Stefanucci, 2019; Kneip and Liebl, 2017; Liebl and Rameseder, 2019), like functional snippets, are also partially observed functional data and have been studied broadly in the literature. However, for data investigated in these works as functional fragments, the span of a single individual domain can be nearly as large as the span of the study, making them distinctively different from functional snippets. Such data, collectively referred to as “nonsnippet functional data” in this paper, often satisfy the following condition:

1. for any , for a strictly increasing sequence .

For instance, Kneip and Liebl (2017) assumed that , which implies that design points and local information are still available in the off-diagonal region . In other words, for non-snippet functional data and for each , one has for sufficiently large , contrasting with (1) for functional snippets. Other related works by Gellar et al. (2014); Goldberg et al. (2014); Gromenko et al. (2017); Stefanucci et al. (2018) on partially observed functional data, although do not explicitly discuss the design, require condition a for their proposed methodologies and theory. All of them can be handled with a proper interpolation method, which is fundamentally different from the extrapolation methods needed for functional snippets.

The analysis of functional snippets is more challenging than non-snippet functional data, since information in the far off-diagonal regions of the covariance structure is completely missing for functional snippets according to (1). Delaigle and Hall (2016) addressed this challenge by assuming that the underlying functional data are Markov processes, which is only valid at the discrete level, as pointed out by Descary and Panaretos (2019). Zhang and Chen (2018) and Descary and Panaretos (2019) used matrix completion methods to handle functional snippets, but their approaches require modifications to handle longitudinally recorded snippets that are sampled at random design points, and their theory does not cover random designs. Delaigle et al. (2019) proposed to numerically extrapolate an estimate, such as PACE (Yao et al., 2005), from the diagonal region to the entire domain via basis expansion. In this paper, we propose a divide-and-conquer strategy to analyze (longitudinal) functional snippets with a focus on the mean and covariance estimation. Once the covariance function has been estimated, functional principal component analysis can be performed through the spectral decomposition of the covariance operator.

Specifically, we divide the covariance function into two components, the variance function and the correlation function. The former can be estimated via classic kernel smoothing, while the latter is modeled parametrically with a potentially diverging number of parameters. The principle behind this idea is to nonparametrically estimate the unknown components for which sufficient information is available while parameterizing the component with missing pieces. Since the correlation structure is usually much more structured than the covariance surface and it is possible to estimate the correlation structure nonparametrically within the diagonal band, a parametric correlation model can be selected from candidate models in existing literature and this usually works quite well to fit the unknown correlation structure.

Compared to the aforementioned works, our proposal enjoys at least two advantages. First, it can be applied to all types of designs, either sparsely/densely or regularly/irregularly observed snippets. Second, our approach is simple thanks to the parametric structure of the correlation structure, and yet powerful due to the potential to accommodate growing dimension of parameters and nonparametric variance component. We stress that, our semi-parametric and divide-and-conquer strategy is fundamentally different from the penalized basis expansion approach that is adopted in the recent paper by Lin et al. (2019) where the covariance function is represented by an analytic basis and the basis coefficients are estimated via penalized least squares. Numerical comparison of these two methods is provided in Section 5.

This divide-and-conquer approach has been explored in Fan et al. (2007) and Fan and Wu (2008) to model the covariance structure of time-varying random noise in a varying-coefficient partially linear model. We demonstrate here that a similar strategy can overcome the challenge of the missing data issue in functional snippets and further allow the dimension of the correlation function to grow to infinity. In addition, we take into account the measurement error in the observed data, which is an important component in functional data analysis but is of less interest in a partially linear model and thus not considered in Fan et al. (2007) and Fan and Wu (2008). The presence of measurement errors complicates the estimation of the variance function, as they are entangled together along the diagonal direction of the covariance surface. Consequently, the estimation procedure for the variance function in Fan et al. (2007) and Fan and Wu (2008) does not apply. While it is possible to estimate the error variance using the approach in Yao et al. (2005) and Liu and Müller (2009), these methods require a pilot estimate of the covariance function in the diagonal area, which involves two-dimensional smoothing, and thus are not efficient. A key contribution of this paper is a new estimator for the error variance in Section 3 that is simple and easy to compute. It improves upon the estimators in Yao et al. (2005) and Liu and Müller (2009), as demonstrated through theoretical analysis and numerical studies; see Section 4 and 5 for details.

## 2 Mean and Covariance Function Estimation

Let be a second-order random process defined on an interval with mean function and covariance function . Without loss of generality, we assume in the sequel.

Suppose is an independent random sample of , where is the sample size. In practice, functional data are rarely fully observed. Instead, they are often noisily recorded at some discrete points on . To accommodate this practice, we assume that each is only measured at points , and the observed data are for , where represents the homoscedastic random noise such that and

. This homoscedasticity assumption can be relaxed to accommodate heteroscedastic noise; see Section

3 for details. To further elaborate the functional snippets characterized by a, we assume that the th subject is only available to be studied between time and , where the variable

, called reference time in this paper, is specific to each subject and is modeled as identically and independently distributed (i.i.d.) random variables. We then assume that,

are i.i.d., conditional on . These assumptions reflect the reality of many data collection processes when subjects enter a study at random time and are followed for a fixed period of time. Such a sampling plan, termed accelerated longitudinal design, has the advantage to expand the time range of interest in a short period of time as compared to a single cohort longitudinal design study.

### 2.1 Mean Function

Even though only functional snippets are observed rather than a full curve, smoothing approaches such as Yao et al. (2005) can be applied to estimate the mean function , since for each , there is positive probability that some design points fall into a small neighborhood of . Here, we adopt a ridged version of the local linear smoothing method in Zhang and Wang (2016), as follows.

Let be a kernel function and a bandwidth, and define . The non-ridged local linear estimate of is given by with

 (^b0,^b1)=argmin(b0,b1)∈R2n∑i=1wimi∑j=1Khμ(Tij−t){Yij−b0−b1(Tij−t)}2,

where are weight such that . For the optimal choice of weight, readers are referred to Zhang and Wang (2018). It can be shown that , where

 Sr =n∑i=1wimi∑j=1Khμ(Tij−t){(Tij−t)/hμ}r, Rr =n∑i=1wimi∑j=1Khμ(Tij−t){(Tij−t)/hμ}rYij.

Although behaves well most of the time, for a finite sample, there is positive probability that , hence may become undefined. This minor issue can be addressed by ridging, a regularization technique used by Fan (1993) with details in Seifert and Gasser (1996) and Hall and Marron (1997). The basic idea is to add a small positive constant to the denominator of when falls below a threshold. More specifically, the ridged version of is given by

 ^μ(t)=R0S2−R1S1S0S2−S21+Δ1{|S0S2−S21|<Δ}, (2)

where is a sufficiently small constant depending on and . A convenient choice here is , where .

The tuning parameter could be selected via the following -fold cross-validation procedure. Let be a positive integer, e.g., , and be a roughly even random partition of the set . For a set of candidate values for , we choose one from it such that the following cross-validation error

 CV(h)=κ∑k=1∑i∈Pkmi∑j=1{Yij−^μh,−k(Tij)}2 (3)

is minimized, where is the estimator in (2) with and subjects in excluded.

### 2.2 Covariance Function

Estimation of the covariance function for functional snippets is considerably more challenging. As we have pointed out in Section 1, local information in the far off-diagonal region, , is completely missing. To tackle this challenge, we first observe that the covariance function can be decomposed into two parts, a variance function and a correlation structure, i.e., , where is the variance function of , or more precisely, , and is the correlation function. Like the mean function

, the variance function can be well estimated via local linear smoothing even in the case of functional snippets. The real difficulty stems from the estimation of the correlation structure, which we propose to model parametrically. At first glance, a parametric model might be restrictive. However, with a nonparametric variance component and a large number of parameters, the model will often still be sufficiently flexible to capture the covariance structure of the data. Indeed, in our simulation studies that are presented in Section

5, we demonstrate that even with a single parameter, the proposed model often yields good performance when sample size is limited. As an additional flexibility, our parametric model does not require the low-rank assumption and hence is able to model truly infinitely-dimensional functional data. This trade of the low-rank assumption with the proposed parametric assumption seems worthwhile, especially because we allow the dimension of the parameters to increase with the sample size. The increasing dimension of the parameter essentially puts the methodology in the nonparametric paradigm.

To estimate , we first note that the PACE method in Yao et al. (2005) can still be used to estimate on the band that includes the diagonal, although not on the full domain . Since , the PACE estimate for on the diagonal gives rise to an estimate of . However, this method requires two-dimensional smoothing, which is cumbersome and computationally less efficient. In addition, it has the convergence rate of a two-dimensional smoother, which is suboptimal for a target that is a one-dimensional function. Here we propose a simpler approach that only requires one-dimensional smoothing, based on the observation that the quantity can be estimated by local linear smoothing on the observations . More specifically, the non-ridged local linear estimate of , denoted by , is with

 (^b0,^b1)=argmin(b0,b1)∈R2n∑i=1wimi∑j=1Khσ(Tij−t)[{Yij−^μ(Tij)}2−b0−b1(Tij−t)]2,

where is the bandwidth to be selected by a cross-validation procedure similar to (3). As with the ridged estimate of the mean function in (2), to circumvent the positive probability of being undefined for , we adopt the ridged version of as the estimate for , denoted by . Then our estimate of is , where is a new estimate of , to be defined in the next section, that has a convergence rate of a one-dimensional smoother. Because also has a one-dimensional convergence, the resulting estimate of has a one-dimensional convergence rate.

For the correlation function , we assume that is indexed by a

-dimensional vector of parameters, denoted by

. Here, the dimension of parameters is allowed to grow with the sample size at a certain rate; see Section 4 for details. Some popular parametric families for correlation function are listed below.

1. Power exponential:

 ρθ(s,t)=exp⎧⎨⎩−|s−t|θ1θθ12⎫⎬⎭,0<θ1≤2,θ2>0.

 ρθ(s,t)={1+|s−t|2θ22}−θ1,θ1,θ2>0.
3. Matérn:

 ρθ(s,t)=1Γ(θ1)2θ1−1(√2θ1|s−t|θ2)θ1Bθ1(√2θ1|s−t|θ2),θ1,θ2>0, (4)

with being the modified Bessel function of the second kind of order .

Note that if are correlation functions, then is also a correlation function if and for all . Therefore, a fairly flexible class of correlation functions can be constructed from several relatively simple classes by this convex combination. We point out here that, even when one adopts a stationary correlation function, the resulting covariance can be non-stationary due to a nonparametric and hence often non-stationary variance component.

Given the estimate , the parameter can be effectively estimated using the following least squares criterion, i.e., with

 ^Qn(θ)=n∑i=11mi(mi−1)∑1≤j≠l≤mi{^σX(Tij)^σX(Til)ρθ(Tij,Til)−Cijl}2,

where is the raw covariance of subject at two different measurement times, and .

## 3 Estimation of Noise Variance

The estimation of received relatively little attention in the literature. For sparse functional data, the PACE estimator proposed in Yao et al. (2005) is a popular option. However, the PACE estimator can be negative in some cases. Liu and Müller (2009) refined this PACE estimator by first fitting the observed data using the PACE estimator and then estimating by cross-validated residual sum of squares; see appendix A.1 of Liu and Müller (2009) for details. These methods require an estimate of the covariance function, which we do not have here before we obtain an estimate of . Moreover, the estimate in both methods is obtained by two-dimensional local linear smoothing as detailed in Yao et al. (2005), which is computationally costly and leads to a slower (two-dimensional) convergence rate of these estimators. To resolve this conundrum, we propose the following new estimator that does not require estimation of the covariance function or any other parameters such as the mean function.

For a bandwidth , define the quantities

 A0 =E[{C(T1,T1)+μ(T1)μ(T1)+σ20}1|T1−T2|

and

 B=E1|T1−T2|

where and denote two design points from the same generic subject. From the above definition, we immediately see that . Also, these quantities seem easy to estimate. For example, and can be straightforwardly estimated respectively by

 ^A0=1nn∑i=11mi(mi−1)∑j≠lY2ij1|Tij−Til|

and

 ^B=1nn∑i=11mi(mi−1)∑j≠l1|Tij−Til|

This motivates us to estimate via estimation of , and .

It remains to estimate , which cannot be estimated using information along the diagonal only, due to the presence of random noise. Instead, we shall explore the smoothness of the covariance function and observe that if is close to , say , then and

 A1≈A2=E[{C(T1,T2)+μ(T1)μ(T2)}1|T1−T2|

Indeed, we show in Lemma 5 that . Therefore, it is sensible to use as a surrogate of . The former can be effectively estimated by

 ^A2=1nn∑i=11mi(mi−1)∑j≠lYijYil1|Tij−Til|

and we set . Finally, the estimate of is given by

 ^σ20=(^A0−^A1)/^B. (8)

To choose , motivated by the convergence rate stated in Theorem 1 of the next section, we suggest the following empirical rule, , for sparse functional snippets, where acts as an estimate for , represents the average number of measurements per curve, is the estimate of defined in Section 2, and represents the overall variability of the data. The coefficient is determined by a method described in the appendix. If this rule yields a value of that makes the neighborhood empty or contain too few points, then we recommend to choose the minimal value of such that contains at least points. In this way, we ensure that a substantial fraction of the observed data are used for estimation of the variance . This rule is found to be very effective in practice; see Section 5 for its numerical performance.

Compared to Yao et al. (2005) and Liu and Müller (2009), the proposed estimate (8) is simple and easy to compute. Indeed, it can be computed much faster since it does not require the costly computation of . More importantly, the ingredients , and for our estimator are obtained by one-dimensional smoothing, with the term in (5)–(7) acting as a local constant smoother. Consequently, as we show in Section 4, our estimator enjoys an asymptotic convergence rate that is faster than the one from a two-dimensional local linear smoother. In addition, the proposed estimate is always nonnegative, in contrast to the one in Yao et al. (2005). This is seen by the following derivation:

 ^A1 =1nn∑i=11mi(mi−1)∑j≠lYijYil1|Tij−Til|

Remark: The above discussion assumes that the noise is homoscedastic, i.e., its variance is identical for all . As an extension, it is possible to modify the above procedure to account for heteroscedastic noise, as follows. With intuition and rationale similar to the homoscedastic case, we define

 ^A0(t) =1nn∑i=11mi(mi−1)∑j≠lY2ij1|Tij−t|

and let

 ^σ20(t)={^A0(t)−^A1(t)}/^B(t)

be the estimate of which is the variance of the noise at . Like the derivation in (9), one can also show that this estimator is nonnegative.

## 4 Theoretical Properties

For clarity of exposition, we assume throughout this section that all the have the same rate , i.e., , where the sampling rate may tend to infinity. We emphasize that parallel asymptotic results can be derived without this assumption by replacing with . Note that the theory to be presented below applies to both the case that is bounded by a constant, i.e., for some , and the case that diverges to as .

We assume that the reference time is identically and independently distributed (i.i.d.) sampled from a density , and are i.i.d., conditional on . The i.i.d. assumptions can be relaxed to accommodate heterogeneous distributions and weak dependence, at the cost of much more complicated analysis and heavy technicalities. As such relaxation does not provide further insight into our problem, we decide not to pursue it in the paper. The following conditions about and other quantities are needed for our theoretical development.

1. The density of each satisfies for all , and the conditional density of given satisfies for a fixed function and for all and . Also, the derivative is Lipschitz continuous on .

2. The second derivatives of and are continuous and hence bounded on and , respectively.

3. and .

In the above, the condition A characterizes the design points for functional snippets and can be relaxed, while the regularity conditions B and C are common in the literature, e.g., in Zhang and Wang (2016). According to Scheuerer (2010), B also implies that the sample paths of are continuously differentiable and hence Lipschitz continuous almost surely. Let be the best Lipschitz constant of , i.e.,

. We will see shortly that a moment condition on

allows us to derive a rather sharp bound for the convergence rate of . For the bandwidth , we require the following condition:

1. and .

The following result gives the asymptotic rate of the estimator . The proof is straightforward once we have Lemma 5, which is given in the appendix.

###### Theorem 1.

Assume the conditions AC and a hold.

1. . With the optimal choice , .

2. If in addition , then . With the optimal choice , .

If we define with , the ridged version of (8), then in the above theorem, can be replaced with and can be replaced with , respectively. For comparison, under conditions stronger than AC, the rate derived in Yao et al. (2005) for the PACE estimator is at best . This rate was improved by Paul and Peng (2011) to . Our estimator clearly enjoys a faster convergence rate, in addition to its computational efficiency. The rate in part b of Theorem 1 has little room for improvement, since when is finite but , the rate is optimal, i.e., . When is finite but in the sparse design, we obtain , in contrast to the rate for the PACE estimator according to Paul and Peng (2011).

To study the properties of and , we shall assume

1. the kernel is a symmetric and Lipschitz continuous density function supported on .

Also, the bandwidth and are assumed to meet the following conditions.

1. and .

2. and .

The choice of these bandwidths depends on the interplay of the sampling rate and sample size . The optimal choice is given in the following condition.

1. If , then , where the notation means . Otherwise, . Also, .

The asymptotic convergence rates for and are given in the following theorem, whose proof can be obtained by adapting the proof of Proposition 1 in Lin and Yao (2020) and hence is omitted. It shows that both and have the same rate, which is hardly surprising since they are both obtained by a one-dimensional local linear smoothing technique. Note that our results generalize those in Fan et al. (2007) and Fan and Wu (2008) by taking the measurement errors and the order of the sampling rate into account in the theoretical analysis. In addition, our convergence rates of these estimators complement the asymptotic normality results in Fan et al. (2007) and Fan and Wu (2008).

###### Theorem 2.

Suppose the conditions AC hold.

1. With additional conditions a and b, . With the choice of bandwidth in d, .

2. With additional conditions a and ac, . With the choice of bandwidth in d, .

To derive the asymptotic properties of , we need the convergence rate of . Define

 Q(θ) =E{σX(T11)σX(T12)ρθ(T11,T12)−[Y11−μ(T11)][Y12−μ(T12)]}2,

and assume the following conditions.

1. is twice continuously differentiable with respect to and . Furthermore, the first three derivatives of with respect to are uniformly bounded for all .

2. for some and , where denotes the true value of , and

denotes the smallest eigenvalue of a matrix.

3. for some and .

Note that in the condition c, we allow the smallest eigenvalue of the Hessian to decay with . This, departing from the assumption in Fan and Wu (2008) of fixed dimension on the parameter , enables us to incorporate the case that is constructed from the aforementioned convex combination of a diverging number of correlation functions, e.g., , where if all components satisfy b uniformly. The condition d, although it is slightly stronger than C, is often required in functional data analysis, e.g., in Li and Hsing (2010) and Zhang and Wang (2016) for the derivation of uniform convergence rates for . Such uniform rates are required to bound sharply in our development, which is critical to establish the following rate for .

###### Proposition 3.

Suppose the conditions AB and ad hold. If , then with the choice of bandwidth in d, .

The above result suggests that the estimation quality of depends on the dimension of parameters, sample size and singularity of the Hessian matrix at , measured by the constant in condition c. In practice, a few parameters are often sufficient for an adequate fit. In such cases, the dimension might not grow with sample size, i.e., , and we obtain a parametric rate for . Now we are ready to state our main theorem that establishes the convergence rate for in the Hilbert-Schmidt norm , which follows immediately from the above results.

###### Theorem 4.

Under the same conditions of Proposition 3, we have With the choice of bandwidth in d, .

In practice, a fully nonparametric approach like local regression to estimating the correlation structure is inefficient, in particular when data are snippets. On the other hand, a parametric method with a fixed number of parameters might be restrictive when the sample size is large. One way to overcome such a dilemma is to allow the family of parametric models to grow with the sample size. As a working assumption, one might consider that the correlation function falls into , a -dimensional family of models for correlation functions, when the sample size is . Here, the dimension typically grows with the sample size. For example, one might consider a -Fourier basis family:

 κθ(s,t)=1ψ(s)ψ(t)dn∑j=1θjϕj(s)ϕj(t),θ1,…,θdn≥0 and dn∑j=1θj=1, (10)

where and are fixed orthonormal Fourier basis functions defined on . The theoretical result in Theorem 4 applies to this setting by explicitly accounting for the impact of the dimension on the convergence rate.

## 5 Simulation Studies

To evaluate the numerical performance of the proposed estimators, we generated from a Gaussian process. Three different covariance functions were considered, namely,

• with the variance function and the Matérn correlation function ,

• with and Fourier basis functions , and