Functional Data Analysis refers to a collection of statistical models that apply to data that can be naturally represented as observations taken from (mostly unobserved) underlying functions. Instead of simply treating the data as vectors, models with a functional structure naturally incorporate into the analysis important characteristics of the random process generating the observations, such as smoothness. Functional models represent the partially observed functions as random elements on a functional space, such as , with a finite interval. Given the infinite dimension of this type of sample spaces, some form of dimension reduction or regularization is typically required. Functional Principal Components Analysis (FPCA) is a commonly used approach to obtain optimal lower-dimensional representations of the observations (Boente et al., 
). Moreover, FPCA can also be used to describe the main characteristics of the process generating the functions, similarly to what is done in the finite-dimensional case with the interpretation of PCA loadings. Specifically, the trajectories of each subject can be represented by their coefficients (scores) on a few elements of the basis of eigenfunctions (see for instance, Ramsay and Silverman and Hall and Horowitz .) Other bases can also be used to represent the trajectories, e.g. a sufficiently rich -splines basis (James et al., ).
In this paper, we study the problem of reliably estimating the principal functions when the data may contain a small proportion of atypical observations. It is well-known that, similarly to what happens in the finite dimensional case, most FPCA methods can be seriously affected in such a situation. The outliers need not be “extreme” data points, but might consist of curves that behave differently from the others, or that display a persistent behaviour either in shift, amplitude and/or shape (see, e.g., Hubert et al. ).
Robust FPCA methods in the literature can be classified in three groups, depending on the specific property of principal components on which they focus. Some of them rely on performing the eigenanalysis of a robust estimator of the covariance or scatter operator. Others estimate the principal functions by searching for directions that maximize a robust estimator of the spread or scale of the corresponding projections. A third group of robust methods estimate the principal subspaces by minimizing a robust measure of the reconstruction error (the discrepancy between the observations and their orthogonal projections on the eigenspace). A more detailed discussion can be found in Section 2 below.
Most functional data analysis methods assume that each trajectory is observed on a relatively fine grid of points, often equally spaced. The corresponding approaches to analyze such functional data can be found, for instance, in Ramsay and Silverman , Ferraty and Vieu  and Ferraty and Romain . Recent reviews of functional data analysis methods include: Horváth and Kokoszka , Hsing and Eubank , Cuevas , Goia and Vieu , and Wang et al. .
To the best of our knowledge, most of the robust FPCA methods proposed in the literature also require that the curves be observed in a relatively dense grid. However, there are many applications in which trajectories are measured only a few times for each sampling unit. Such longitudinal data sets with only a few available observations per curve (possibly recorded at irregular intervals), are relatively common in applications, and in many cases it is sensible to consider an underlying functional structure (of smooth trajectories, for example). Only a few FPCA methods exist in the literature for this type of data (see for example, James et al. , Yao et al.  and Li and Hsing ).
The approach of James et al.,  consists of assuming a finite-dimensional expansion for the underlying random process (i.e. a Karhunen-Loeve expansion with only finitely many terms), and approximating the mean function and eigenfunctions with splines. An alternative proposal was given by Yao et al.  (see also Staniswalis and Lee ). It consists of estimating the covariance function using a bivariate smoother over the cross-products of the available observations. In this way, information about the covariance function of the process at different points is “borrowed” from different curves. Principal directions are constructed from the estimated covariance function, and scores are estimated using best linear predictors. Similarly, Li and Hsing  estimate the mean and covariance functions with local linear smoothers, but use weights to ensure that the effect that each curve has on the optimizers is not overly affected by the number of points at which they were observed.
To construct robust FPCA estimators, Gervini  modified the reduced rank proposal of James et al.,  assuming that the finite-dimensional standardized scores and the measurement errors have a joint multivariate Student’s T distribution. The resulting estimators for the centre function and eigenfunctions are Fisher-consistent. A related method using -estimators for the basis coefficients and scores has been recently proposed by Maronna .
An intuitively straightforward approach to obtain a robust version of the method proposed by Yao et al.  would be to replace the bivariate smoother of the cross-products with a robust alternative. However, the regression approach of Staniswalis and Lee , and Yao et al. 
works because the smoother estimates the conditional mean of the cross-products, which is the covariance function. The main challenge with using a robust smoother in this setting is that, to the best of our knowledge, there are no robust estimators for the expected value of asymmetrically distributed random variables without imposing distributional assumptions. Hence, a robust smoother of the cross-products will typically not be able to estimate the covariance function.
In this paper, we provide an alternative approach to obtain a robust estimate of the covariance function in the case of sparsely observed functional (longitudinal) data. Furthermore, this proposal can naturally be adapted to offer another way to perform FPCA for longitudinal data in the settings considered by Yao et al. . Our proposal is based on a well-known property of the conditional distribution of elliptically distributed random vectors: the conditional location parameter of one of its components given others is a linear function of the set of conditioning values. We first show that for an elliptically distributed random process, the vector of its values at a fixed set of points has a multivariate elliptical distribution. Combining this property with the previous one, we propose to use a robust local regression method to estimate the values of the covariance function when outliers may be present in the data. Our approach can also be used with a non-robust local regression method, and our numerical experiments show that this method compares favourably to that of Yao et al. .
The rest of the paper is organized as follows. In Section 2 we review previously proposed robust FPCA methods which are applicable when either the entire trajectories were observed, or they were recorded on a dense grid of points. Section 3 describes our robust estimator of the principal directions for sparsely recorded data. We discuss the results of our numerical experiments studying the robustness and finite sample performance of the different methods in Section 4. An illustration is discussed in Section 5, while Section 6 contains a discussion and further recommendations.
2 Robust methods for FPCA
As mentioned in the Introduction, atypical observations in a functional setting need not be “extreme” points and may occur in several different ways. Often they correspond to individuals that follow a different pattern than that of the majority of the data. The detection of such outliers is generally a difficult problem, and some proposals exist in the literature. Among others, we can mention the procedures described in Febrero et al. [15, 16], Hyndman and Ullah , Hyndman and Shang , Sawant et al. , and Sun and Genton .
In this paper we are concerned with methods that do not require the preliminary detection of potential atypical observations in the data. We need to introduce some notation and definitions. Although we are motivated by situations where data that can be represented as random elements on a functional space, like , with a finite interval, most of our discussion can be generalized to the case where is a random element on an arbitrary separable Hilbert space . Denote the corresponding inner product and norm with and , respectively. In classical FPCA one assumes that , which ensures the existence of the mean and the covariance operator (denoted by and , respectively). The covariance operator can be written as , where
denotes the tensor product on, i.e., the operator given by for . When , the mean function satisfies for , and the covariance operator is associated with a symmetric, non-negative definite covariance function such that , as follows .
The covariance operator
is linear, self-adjoint, continuous, and Hilbert-Schmidt. Hilbert-Schmidt operators have a countable number of eigenvalues, all of them real when the operator is self-adjoint. Letbe the eigenfunctions of associated with its eigenvalues in decreasing order. The Karhunen-Loève expansion of the process is given by the representation , where the convergence of the series is in mean square norm, that is, . The random variables are the coordinates of on the basis , that is, . Note that , , and for .
In multivariate analysis, elliptical distributions provide an important framework for the development of robust principal component procedures. Elliptical processes play a similar role for the development of robust FPCA methods, allowing for random elements that may not have finite second moments. Given, usually known as the location parameter, and , a self-adjoint, positive semi-definite and Hilbert-Schmidt operator on (the scatter operator), we say that the process has an elliptical distribution , if for any linear bounded operator the random vector has a multivariate elliptical distribution , where denotes the adjoint operator of and stands for the characteristic generator. It is easy to see that Gaussian processes satisfy the above definition with . When second moments exist, the mean of is and the covariance operator is proportional to , i.e., and for some (see Lemma 2.2 in Bali and Boente ). As noted in Boente et al. , the scatter operator is confounded with the function meaning that for any , where . For that reason, without loss of generality, when second moments exist, we will assume that . We refer the interested reader to Bali and Boente  and Boente et al. .
Like their finite-dimensional counterparts, functional principal components satisfy the following three equivalent properties:
Property 1: The first principal directions correspond to the eigenfunctions of the covariance operator associated with its largest eigenvalues.
Property 2: The first principal direction maximizes over the unit sphere , and subsequent ones solve the same optimization problem subject to the condition that be orthogonal to all the previous principal directions.
Property 3: The first principal directions provide the best -dimensional linear approximation to the random element in terms of mean squared error. Specifically, let denote the orthogonal projection of onto an arbitrary -dimensional closed linear space , and be the linear space spanned by the eigenfunctions of associated with its largest eigenvalues . If , then we have .
As mentioned in the Introduction, robust functional principal components estimators proposed in the literature can be classified in three groups, according to the property on which they focus. The first approach is based on obtaining a robust counterpart to the sample covariance operator, and use its eigenfunctions to perform robust FPCA. The second group consists of methods that sequentially search for directions that maximize a robust estimator of the spread or scale of the projections. The last class of robust FPCA methods estimates principal subspaces by searching the closed linear subspace that minimizes a robust measure of scale of the reconstruction error (the distance between the data and their orthogonal projections onto the subspace), where denotes a location parameter.
In the rest of this section we review the robust FPCA methods that are available in each of these three groups. Our proposal for the case of sparsely observed trajectories is discussed in the next Section.
2.1 Methods based on Property 1
Probably the earliest robust functional principal components estimators in the literature are those introduced in Locantore et al. . They proposed the so-called spherical principal components, which were further studied in Gervini  and Boente et al. . The basic idea is to control potential outliers in the data by normalizing the observations (forcing all curves to lie on the unit sphere). Formally, given a centre , the spatial or sign covariance operator of centered at is defined as
which can be estimated through its sample version:
Several location parameters (centres) have been considered in the literature. When using the sign covariance operator (1) the standard choice for the centre is the functional spatial or geometric median: (see Locantore et al.  and Gervini ). It can be estimated by its sample version leading to the usual spatial operator estimator . The spherical principal direction estimators correspond to its eigenfunctions. We refer to Gervini  and Cardot et al.  for consistency results of the geometric median estimator and to Boente et al.  for results regarding the asymptotic distribution of and the spherical principal directions. Other robust location estimators for above may be considered. For example, the -trimmed mean introduced in Fraiman and Muñiz , the deepest point (e.g. Cuevas , Cuevas et al.  and López-Pintado and Romo ) and the -location estimator in Sinova et al. .
In addition to being easy to compute, another appealing property of the spherical principal directions is that, under certain conditions they can be shown to be Fisher-consistent. Specifically, assume that either has an elliptical distribution, as defined above, with location and scatter operator , or that has a finite-rank representation , where the random vector has exchangeable symmetric marginal distributions (in this case, we denote ). Then, the eigenfunctions of are the same as those of and in the same order (see Boente et al.  and Gervini ).
Other robust scatter operator estimators have been proposed in the literature. For example, the -scatter operator of Kraus and Panaretos , and the geometric median covariation studied in Cardot and Godichon-Baggioni . However, the eigenfunctions of these operators are not guaranteed to remain in the same order as those of the true covariance operator when the latter exists, or of the true scatter operator for elliptical processes.
Sawant et al.  proposed to estimate the principal directions and their size indirectly using the following approach. Ramsay and Silverman  showed that if the observed trajectories and the eigenfunctions of the covariance operator are represented on a common known basis (e.g. splines or Fourier), then FPCA can be performed using standard finite-dimensional PCA on the matrix of coefficients representing the observations on the chosen basis. The robust version of this procedure in Sawant et al.  consists of replacing the PCA step above with a robust PCA alternative. In particular, they used ROBPCA (Hubert et al. ) and BACONPCA (Billor et al. ).
2.2 Methods based on Property 2
Proposals to perform robust FPCA relying on Property 2 above are typically referred to as projection-pursuit approaches. Hyndman and Ullah  discussed a robust projection-pursuit method applied to smoothed observed trajectories. Bali et al.  generalized this approach combining it with penalization and basis reduction. In order to impose smoothness on the estimated eigenfunctions, regularization is included via a penalization operator. Consider the subset , of smooth elements of and a linear operator, usually called the “differentiator”, since when one typically takes . Associated with one can construct the following symmetric positive semi-definite bilinear form , where , and the functional as . Finally, a penalized inner product is defined as , and the associated norm is .
Let be a robust univariate scale estimator (e.g. an M-scale estimator) computed on the sample projections , , . The intuitive idea is to estimate the first principal direction as the element that maximizes , over the unit sphere . In addition, Bali et al.  used a sieves approach to approximate the elements of with an increasing sequence of known bases (e.g. splines). Formally, let be a basis of , and the linear space spanned by , with as . The robust projection-pursuit estimator for the first principal direction in Bali et al.  is given by
The other principal directions are defined sequentially adding the condition that they be orthogonal to the previous ones with respect to the inner product . Bali et al.  showed that the estimators are qualitative robust. Furthermore, the procedure is Fisher-consistent when the robust univariate scale estimator is the empirical version of a functional such that there exist a constant and a self-adjoint, positive semidefinite and Hilbert-Schmidt operator satisfying for all , where stands for the distribution of . This condition is satisfied when has an elliptical distribution.
Note that the approaches described here and in Section 2.1 require that one is able to compute or approximate well the norms and inner products . For example, when this typically means that the trajectories need to have been observed over a relatively dense grid of points. When the data are measured sparsely (and only a few points per trajectory are available) these methods become infeasible.
2.3 Methods based on Property 3
The last group of robust FPCA methods estimate the eigenspaces directly, for example by minimizing a robust measure of the distance between the observations and their orthogonal projections on finite-dimensional subspaces. Lee et al.  proposed a sequential algorithm fitting linear spaces of dimension 1 to the observed data. Their proposal assumes that all trajectories , , were observed on a common grid . The goal is to find a smooth function and a vector that minimize the size of the residuals
. Because of the potential presence of outliers, the method uses a bounded loss function. Given an initial estimator for , let , , and for each , let be a robust scale estimator of the residuals , . The estimators for and are defined as the minimizers of
subject to the constraint , where is a user-chosen regularization parameter. This problem can be solved iteratively as follows. Given let be the minimizer of the objective function above, which can be obtained using iterative reweighted penalized least-squares, and given , the entries of are the projections of each trajectory along the direction : , . These steps are then iterated until convergence. Once the first principal direction is obtained, we can compute the corresponding estimated scores . The other estimated principal directions are computed sequentially as follows: assume that , , are estimators of the first principal directions, then, the th principal direction and the related scores are computed applying the previous procedure to . Lee et al.  also discussed a data-dependent and resistant procedure to select .
In some applications each curve may be observed on a different (but dense) grid, and in that case we can, in principle, proceed as recommended in Ramsay and Silverman , namely: smooth each curve and apply the method above to the values of the smoothed trajectories on a fixed grid. A drawback of this strategy is that the errors introduced by the data smoothing step cannot be included easily in the analysis.
A different approach to estimate functional principal subspaces is as follows. Assume that each trajectory is observed on a dense grid of points (that may be specific to each curve), and let be an orthonormal basis of . The basic idea is to identify each curve with the vector of its coefficients on a finite-dimensional basis, apply a robust multivariate method to estimate principal subspaces in this finite-dimensional space, and then map the results back to . Specifically, fix , the basis size, and let be the coefficient of the -th trajectory on the -th element of the basis, . The grids where each are observed need to be sufficiently dense so that these inner products can be approximated well using finite Riemman sums. For each , let be the vector of coefficients of on the basis : . We can now apply robust multivariate methods on the sample to estimate a dimensional principal subspace . Once the principal subspace is found, the functional principal direction estimators can be reconstructed by “mapping back” the results in onto . To fix ideas, let be an orthonormal basis for and let be the -dimensional approximation to , where . The functional location parameter can be reconstructed as , while the -dimensional principal direction basis in is , for . Furthermore, the “fitted values” in are .
The proposals of Boente and Salibian-Barrera  and Cevallos-Valdiviezo  are variants of this approach that differ on how the principal subspace and its orthonormal basis are estimated. Let and the -th row of . For a given and , the corresponding “fitted values” are , where are the columns of the matrix .
Boente and Salibian-Barrera  estimated the principal subspace minimizing , where is a robust scale of the -th coordinate of the residual vectors , . Although in principle one can use any robust scale estimator, Boente and Salibian-Barrera  discussed in detail the case of -scale estimators (see Maronna et al. ), for which an iterative algorithm can be derived. Cevallos-Valdiviezo  proposed using a multivariate -estimator for PCA, minimizing , where is a robust scale of the distances , using -scale and least-trimmed scale estimators. All these proposals are Fisher-consistent for elliptically distributed random processes.
3 FPCA for longitudinal (sparse) data
In this section we describe a new robust FPCA method that is applicable to the case where few observations per trajectory may be available. Moreover, when robustness is not a concern, our proposal results in a novel method for non-robust FPCA that compares favourably with existing methods in the literature.
We will assume the following framework. Let be a stochastic process defined on a probability space with continuous trajectories, where is a finite interval, which can be assumed to be without loss of generality. To allow for atypical observations, we will include processes for which finite second moments may not exist. Specifically, we will only assume that the process has an elliptical distribution , where and is a self-adjoint, positive semi-definite and Hilbert-Schmidt operator on . We will denote as the kernel defining , that is, . Recall that is symmetric and , since is a Hilbert-Schmidt operator. Finally, will denote an orthonormal basis of consisting of eigenfunctions of the scatter operator ordered according to the associated non-zero eigenvalues .
Note that elliptically distributed random processes as defined in Section 2 accept a Karhunen-Loève representation when has finite rank or when the kernel is continuous. Consider first the case where the scatter operator has finite rank (i.e. a finite number of non-zero eigenvalues). It follows that , where and has a multivariate elliptical distribution (see the proof of Proposition 3.1). When there are infinitely many positive eigenvalues, Proposition 2.1 in Boente et al.  shows that , where is a random variable independent from the zero-mean Gaussian random element . The standard Karhunen-Loève expansion for Gaussian processes implies that with , . The continuity of the covariance function and the process implies that the convergence of the series is uniform over with probability 1. Thus, defining , we obtain that has a multivariate elliptical distribution and with covariance operator . Hence, we can write
where the scores are such that, for any , the random vector has a multivariate elliptical distribution . When second moments exist, the scores are uncorrelated random variables.
The following Proposition provides the main motivation for our approach. It shows that, similarly to what holds for Gaussian processes, the vector obtained by evaluating an elliptically distributed random process on a fixed finite set of points is an elliptical random vector on . Furthermore, it also shows that the conditional distribution of the scores in (2) given a set of observations of the process is elliptical. This result suggests a natural estimator for the ’s which can be used to reconstruct the full trajectories in the sample. The proof can be found in Section 7.
Let be a random element on , with , and assume that the kernel associated with is continuous. Let be the non-null eigenvalues of and as the eigenfunction of associated with chosen so that the set is an orthonormal set in . Then,
For any fixed and , , …, in , the random vector has an elliptical distribution in with location and scatter matrix with elements , .
For any we have that where the conditional location is given by
For any fixed and , , …, in , let . We have that where
with , and the -th element of equals .
In what follows we will use to denote independent realizations of the stochastic process . We assume that each trajectory is observed at independent random “times” , , where the ’s, , are usually assumed to be random variables, independent from all others. Then, our model is:
The procedure has three steps: (1) estimating the center function ; (2) estimating the diagonal elements of the scatter function , and (3) estimating the off-diagonal entries of .
3.2.1 Step 1
We start by estimating the center function using a robust local -estimator. For example, one can consider the local smoothers proposed in Boente and Fraiman , Härdle and Tsybakov , Härdle  and Oh et al.  or robust local linear smoothers as defined in Welsh . Here we use the latter option.
Consider a -function as defined in Maronna et al. . That is, is even, nondecreasing as a function of , , and increasing for when . For each , let
where is a preliminary robust consistent estimator of scale and the weights are given by
where is a bandwidth parameter and is a kernel function, i.e. continuous, non-negative, and integrable. Then, the estimate for is
The preliminary scale estimator in (6) can, for example, be a “local MAD”, that is: the MAD of the observations in a neighborhood of . Formally:
is a constant ensuring Fisher-consistency at a given distribution. If we consider a gaussian distribution as the central model, then, where
is the cumulative distribution function of the standard normal.
3.2.2 Step 2
To estimate the diagonal elements of the scatter function we use a robust scale estimator of the residuals. Although in principle one can use any robust scale estimator, we expect smooth functionals (like M-scales) to provide more stable estimators. More precisely, let be a bounded -function, such that , and let be a fixed constant which defines the robustness of the scale estimator (Maronna et al. ). Then satisfies
where the weights are as in (7). We choose so that , where to ensure that in the case of i.i.d. observations with Gaussian errors, the resulting scale estimator is consistent and has maximum breakdown point.
3.2.3 Step 3
To estimate the off-diagonal elements , , we take advantage of Proposition 3.1, which shows that the conditional mean of given is a linear function of the conditioning value, and furthermore, that the slope is proportional to . This suggests that we can use local regression methods to estimate . Specifically, we first estimate the slope in (3) with a local regression estimator and then use that , where can be estimated as in (8). More precisely, let be the centred observations, and let be the local -regression estimator satisfying
where is a preliminary robust scale estimator and is a bounded -function. To compute we use a local MAD of the residuals from an initial robust estimator. Specifically, let , and noting that the model does not include an intercept, let be the local median of the slopes: . Construct residuals and let be the corresponding local MAD:
With computed in (9) and from Step 2, we define
To ensure that the estimated function is symmetric and smooth, we also compute and use a two-dimensional smoother (e.g. a bivariate -spline as described in Section 4) on each of them, resulting in and . Finally, the estimated scatter function is:
Note that even though defines a symmetric kernel, it may not be semi-positive definite. If necessary, after computing its eigenvalues , and corresponding eigenfunctions , we set as .
3.2.4 Reconstructing trajectories
Once we have estimated the eigenfunctions and eigenvalues of the scatter function, Proposition 3.1(c) suggests a natural way to reconstruct the trajectories. Let the -th score of the -th observation. Recall that the conditional distribution of given is elliptical with location parameter
where , and is the matrix with th element equal to . Note that if the process has finite mean then (10) equals , which is the best predictor for based on the observed trajectory.
A natural estimator for (10) consists of replacing the unknown quantities with their estimators obtained in Steps 1 - 3 above. Specifically:
where , , is the matrix with th element , and is a small regularization constant to ensure non-singularity (see Yao et al. ). Finally, for a fixed , the reconstructed curves are
3.3 The non-robust case
Note that using in the method described above naturally yields a non-robust version of our proposal, which is different from that of Yao et al. . Moreover, our numerical experiments (see Section 4) indicate that when the data do not contain outliers the non-robust version of our proposal compares favourably to PACE (Yao et al. ).
4 Monte Carlo study
In this section we report the results of a Monte Carlo study carried out to investigate the finite-sample performance and robustness of the proposed robust FPCA estimator. Our experiments compared the robust and non-robust versions of our proposal and PACE, the approach in Yao et al. . We report results for random processes with two covariance functions and different atypical observations. In each case we generated 500 samples of size , and focused on the behaviour of the estimated eigenfunctions, eigenvalues, predicted scores, and accuracy of the scatter function estimator for both clean and contaminated samples.
4.1 Simulation settings
For clean samples, the data sets are generated from the following model
with i.i.d. and . We considered two models for the location function and the principal directions :
Model 1: , , ,