Principal component analysis (PCA) plays a fundamental role in statistics due to its ability to focus on a parsimonious data subspace that is the most relevant for all practical purposes. Thus, it becomes a tool not only for data exploration but at the same time for model selection – how many important components of variation are there in the data? For high/infinite dimensional data, e.g., as in functional data analysis, it assumes an even more prominent role because a reduction in the data dimension brings the statistical problem back to more or less a classical multivariate setting. Furthermore, the various regularization techniques used for functional regression, testing and other inference problems, typically hinge on the identification of the most prominent sources of variation in the data.
One of the main drawbacks of principal component analysis for any kind of data is that the procedure of estimation/selection of the number of components to retain is often exploratory in nature. Indeed, one has to either inspect the scree plot or select the first few components that explain, sayof the total variation (see, e.g., Jolliffe (2002)). There are but a few confirmatory procedures to this end (see, e.g., Horn (1965), Velicer (1976) and Peres-Neto et al. (2005)). However, each of these procedures rely on its own assessment of what is an appropriate definition of the dimension of the data corresponding to how many components to retain. In this paper, we view the problem from the perspective of hypothesis testing. Indeed, the high level or global problem that is being considered is that of versus , where is the covariance matrix of the -dimensional distribution that generates the data. Thus, we want to test whether the data gives us enough evidence to conclude that the variation in it can be attributed to all the available
dimensions. If this null hypothesis is rejected based on the observed data, one can also consider a more intricate analysis and test thelocal hypotheses versus for each .
When dealing with functional data, one focusses on a covariance kernel (rather than a covariance matrix), which a priori is an infinite dimensional object. One would then wish to test versus at a global level, where the rank of is finite iff its Mercer expansion has at most finitely many terms. Second, we can observe each functional datum at at most a finite number, say , of points on a grid, i.e., in the form of a -tuple. For brevity, let us assume that the observation grid remains the same for each sample point. Based on the vectors of size , the covariance kernel can only be estimated at the resolution of a matrix. This estimator has rank at most . So, the global test that is feasible based on the discretely observed functional data is versus . The local hypotheses are consequently versus for . Based on the discrete observations, the simplest procedure would be to look at the rank of the empirical covariance matrix. Of course, if as in quite common in practice, the empirical covariance matrix will at least asymptotically estimate the true rank correctly, which would lead to a consistent test.
At first sight, this problem seems simple: if the number of observations exceeds the true rank, then perfect inference will be feasible. However, functional data are most often additively corrupted by unobservable measurement errors that are usually modelled as independent random variables indexed by the grid points for each sample function. This additional noise adds a “ridge” to the true covariance kernel and thus makes it a truly infinite dimensional object. More specifically, the covariance matrix of the observed erroneous data is offull rank. Clearly, this gives rise to a problem of the true rank being confounded by the additive noise. One way of removing the effect of the errors is to use some smoothing procedure on the data (see e.g., Ramsay and Silverman (2005)). But this smoothing step obfuscates the problem since the relationship between the rank of and the rank of the smoothed data is unknown, and further depends on the choice of tuning parameter(s) used for smoothing. At this stage, it would seem that the problem is “almost insolubly difficult” as pointed out by Hall and Vial (2006), who further concluded that ““conventional approaches based on formal hypothesis testing will not be effective”. As a workaround, Hall and Vial (2006) considered a “low noise
” setting (assuming the noise variance vanishes as the number of observations increases) and used an unconventional rank selection procedure based on the amount ofunconfounded noise variance. A weakness of the procedure was that it required the analyst to provide acceptable values of the noise variance for the procedure to be implemented in practice, and these bounds are to be selected in an ad hoc manner.
An alternative approach altogether is to view the problem not as one of testing, but rather as one of model selection. For instance, as part of their PACE method, and assuming Gaussian data, Yao et al. (2005) offer a solution based on a pseudo-AIC criterion applied to a smoothed covariance whose diagonal has been removed. Later work by Li et al. (2013) provides estimates of the effective dimension based on a BIC criterion employing and the estimate of the error variance obtained using the PACE approach with the difference being that they used a adaptive penalty term in place of that used in the classical BIC technique. For densely observed functional data, Li et al. (2013) also studied a modification of the AIC technique in Yao et al. (2005) by assuming a Gaussian likelihood for the data. Li et al. (2013) finally considered versions of information theoretic criteria studied earlier by Bai and Ng (2002) in the context of factor models in econometrics, where the latter method is used to choose the number of factors. For all of the procedures studied by Yao et al. (2005) and Li et al. (2013), the most crucial drawback is the involvement of smoothing parameters which enters due to use of smoothing prior to dimension estimation. The asymptotic consistency of these procedures also depends on specific decay rates for the smoothing parameters, as well as on rather elaborate assumptions on the regularity of the true mean and covariance functions, and in some cases on a Gaussian assumption.
In this paper, we steer back to a formal hypothesis testing perspective for the dimension problem. Contrary to what was previously thought, we demonstrate how to construct a valid test that circumvents the smoothing step entirely by means of matrix completion. In doing so, we not only provide an estimate of the true rank of the data, but at the same time we provide confidence guarantees, something lacking in earlier approaches. The test statistics measures the best possible least square fit of the empirical covariance’s off-diagonal elements, by nonnegative matrices of a given finite rank, exploiting the fact that the corruption affects only the diagonal, and is calibrated by a bootstrap approach with suitable centering. Intriguingly, it is shown to attain near perfect power at any level of significance, for sufficiently large samples, an effect that can be seen as being genuinely functional. Our approach presents appealing advantages compared to existing alternatives, in terms of both theory and practical performance:
It provides a genuine testing procedure, and therefore is the only procedure, to the best of our knowledge, that comes with confidence guarantees.
Unlike existing model selection procedures, it does not rely on smoothing and consequently on tuning parameters that influence the result and introduce bias.
Its theoretical properties rely on bare-bones regularity assumptions by assuming only continuity of the covariance kernel and continuous sample paths, in contrast with existing approaches which require elaborate regularity conditions.
It can handle heteroskedastic measurement errors, under which methodology hinging on pre-smoothing may face considerable difficulties, as is known from nonparametric regression.
It does not require a “low noise” regimei.e. the assumption that the noise is negligible relative to the signal. To the contrary, it can handle interlaced regimes, where the noise variance can exceed the variance of some of the principal components.
It is shown to be consistent without relying on “rate” assumptions on tuning parameters, Gaussianity assumptions, or assuming the grid to become dense as the number of curves increases.
It exhibits remarkably good finite sample performance, stably across a wide range of scenarios, contrary to other methods that may perform modestly in some settings.
The paper is organized as follows. In subsection 2.1, we discuss the problem statement and setup in detail. We then develop the test procedure through some crucial identifiability results in subsection 2.2. The asymptotic distribution and the bootstrap calibration for implementing the test is detailed in subsections 2.3 and 2.4. Since the test statistic involves a non-convex optimization step and an associated bootstrap step, we discuss the computational aspects of its implementation in Section 2.5. A very detailed simulation study is presented in section 3, where we compare the performance of our procedure with those studied by Yao et al. (2005) and Li et al. (2013). Some real data analyses are presented in 4. The technical details are collected in Section 5.
2.1 Problem Statement and Background
Let be the stochastic process in question, assumed zero mean and with continuous covariance kernel
Continuity of implies that it admits the Mercer expansion,
with the series on the right hand side converging uniformly and absolutely. Consequently, is mean square continuous and admits a Karhunen-Loève expansion
where is a sequence of uncorrelated zero-mean random variables with variances , respectively. Equality is in the mean square sense, uniformly in . Given i.i.d. replications of , we observe the noise-corrupted discrete measurements
for a grid of points
and an array of measurement error variables. We assume that the
’s are continuous random variables, independent of the
’s and themselves independent across both indices, with moments up to second order given by
Note, in particular that the are allowed to be heteroskedastic in , i.e. the measurement precision may vary over the grid points. The measured vectors are now i.i.d. random vectors in with covariance matrix
is the matrix obtained by pointwise evaluation of on the pairs , and
is the covariance matrix of the -vector .
In this setup, we wish to use the observations in order to infer whether the stochastic process is, in fact, finite dimensional, and if so what its dimension might be. We use the term infer in its formal sense, i.e. we wish to be able to make statements in the form of hypothesis tests with a given level of significance. Concretely, the question posed pertains to whether the Karhunen-Loève expansion (2) of is a finite sum rather than an infinite series, and if so, with how many terms; or, equivalently, whether the covariance is of finite rank, in the sense of a finite Mercer expansion (1), and if so of what rank.
In terms of hypothesis tests, testing whether the dimension of the process is finite can be phrased in terms of the hypothesis pair
Notice that we can never actually test wether the dimension is actually infinite under the alternative, since we have finite data, which is why we have to settle with . Typically so that . This global hypothesis pair is related to the local hypotheses
for . In particular, if we can test all local hypotheses with a controlled family-wise error rate, then we will have a test for the global hypothesis, in addition to a means to infer what the rank is, if indeed finite (more details in the next section). In any case, can be replaced by in the null hypotheses, provided is sufficiently large:
If , then there exists such that
As noted in the introduction, while this question is of clear intrinsic theoretical interest, it also arises very prominently when carrying out a functional PCA as a first step for further analysis, in particular when evaluating a scree plot to choose a truncation level: the choice of a truncation dimension can be translated into testing whether the rank of is equal to .
The frustrating traedoff faced by the statistician in the context of this problem is that:
Without any smoothing, the noise covariance confounds the the problem by the addition of a ridge to the empirical covariance, leading to an inflation of the underlying dimensionality. Specifically, the rank of is always
, with probability 1.
Attempts to denoise and approximate by means of smoothing will obfuscate the the problem, since the choice of tuning parameters will interfere with the problem of rank selection.
It is this tradeoff that Hall & Vial Hall and Vial (2006) presumably had in mind when referring to this problem of rank inference as “almost insolubly difficult”. Despite the apparent difficulty, we wish to challenge their statement that “conventional approaches based on formal hypothesis testing will not be effective”, demonstrating that this can be achieved via matrix completion. The crucial obervation is that perhaps the corrupted band can be entirely disregarded, while still being able to make statements about the rank, owing to the continuity of the problem. How precisely is explored in the next section.
2.2 The Testing Procedure
The main idea we wish to put forward here is that it is feasible make inferences about the rank of without resorting to smoothing or low noise assumptions, simply by focussing on the off-diagonal elements of the matrix . The point is that we have no information whatsoever on the diagonal matrix , and cannot attempt to annhilate it by means of smoothing without biasing inference on the rank. Still, we
i.e. the matrices are equal off the diagonal, even if their relationship on the diagonal is completely unknown. So, if we can determine from its off-diagonal entries, then we have a way forward. The first of our main results shows that this is possible, with bare-bones assumptions on :
Theorem 1 (Identifiability).
The kernel is continuous on
The grid satisfies as .
Then, there exists a critical (depending on the modulus of continuity of ) such that for and any matrix the following implication holds:
Furthermore, if we additionally assume that , then for all the map
Vanishes uniquely at when ranges over matrices of rank .
Is strictly positive when ranges over matrices of rank
Here, , is the Frobenius matrix norm, and ‘’ denotes the Hadamard (element-wise) product.
The theorem is consequential: it affirms that we can check whether the rank of equals by means of the off-diagonal entries of , and indeed constructively so. In particular, the first conclusion of the theorem allows us to reason as follows, for sufficiently large :
For a candidate rank , check whether
If the minimum is positive we are certain that rank.
Of course in practice the is unobservable and we must rely on , the empirical covariance of the observed vector ,
This motivates testing the local hypothesis pair by means of the test statistic
rejecting in favour of for large values of . Note the interpretation of the test statistic: to test whether the rank is , we measure the best possible fit of the off-diagonal elements of the empirical covariance by a matrix of rank . We reject when this fit is poor, and the calibration of
is considered in the next two sections, via an asymptotic analysis based on-estimation, and hinging on conclusion (2) of the Theorem.
For the moment, though, assume that we can obtain a -value for (or some appropriately re-scaled version, e.g. ) under the hypothesis . A priori we do not know what specific value of we should test for, or the candidate value may have come from the inspection of a scree-plot (data snooping). We thus need to consider a multiple testing setup. Doing so will also allow us to test the global hypothesis pair by means of the local tests .
We do so by the following stepwise procedure, for a given significance level
Test vs by means of .
Stop if the corresponding -value, exceeds ; otherwise continue to Step 2.
Test vs by means of .
Stop if the corresponding -value, , exceeds ; otherwise continue similarly.
Obviously, we proceed no further than the last local hypothesis . We reject the global null if and only if the precedure terminates with the rejection of all local hypotheses, up to and including . If the procedure terminates earlier, the global null is not rejected, and we subsequently declare the rank of the functional data to be the value
i.e. the smallest for which we fail to reject . This stepwise procedure strongly controls the FWER at level (see Maurer et al. (1995) and Lynch et al. (2017)). Indeed, observe that at most one of the hypotheses can be true, and suppose it corresponds to . Then, if denotes the number of false discoveries among the number of rejections made (i.e., discoveries), we must have
So, FWER = , where the probabilities are calculated under the given configuration of true and false null hypotheses, equivalently, under the assumption that is true (which automatically ensures that the other hypotheses are false). Since is arbitrary, the FWER is controlled at level .
Finally, if is the true rank of the process, then for , we have
Thus, the control over the FWER translates into a control over the probability of over-estimating the true rank.
In order for the procedure to be implemented in practice, we will require the -values corresponding to the test statistic under . The next two section justify the use of this test statistic and develop proxy -values by means of a bootstrap procedure. En route, they also establish the consistency of the test as well as an intriguing property promising that it may attain near perfect power even in finite samples.
2.3 Theoretical Justification
To justify the use of the test statistic for testing , we will derive its asymptotic distribution under the null hypothesis, after appropriate re-scaling. To this aim, we introduce the following spaces and functionals:
Furthermore, we make the following assumptions:
- Assumption (C):
The covariance kernel is continuous on .
- Assumption (H):
There exists a factorization of (satisfying under ) such that is invertible.
- Assumption (G):
The grid satisfies
where and .
Under these mild assumptions, we can state our second main result:
Theorem 2 (Asymptotic Distribution of the Test Statistic).
Suppose that Assumptions (C), (H) and (G) hold. Denote the weak (centered Gaussian) limit of by the random matrix
by the random matrix. Then,
When is valid,
When is valid, diverges to infinity as .
The theorem justifies the use of as a test statistic: though will not be precisely zero even when the true rank is , the test statistic will converge to zero under , with an asymptotic variance of the order of . Furthermore, the diffuse limiting law of under in principle allows for calibration, though it does depend on unknown quantities and its form prohibits pivoting. Nevertheless, note that if can be calibrated under , then its divergence under promises not only a consistent test, but in fact a test likely attaining near perfect power for any level of significance even in finite samples of moderate size (for more discussion on this see Remark 2 and the simulations in Section 3).
2.4 Bootstrap Calibration
To address the calibration problem, we consider a bootstrap strategy in order to generate approximate -values of for testing the pair . Notice that a naïve bootstrap implementation (direct resampling of our observations) can only generate data under the hypothesis , where is the unknown true rank. For any , the naive bootstrap will fail to correctly approximate the sought -values. In effect, we need a more elaborate re-centering scheme in order to properly resample and generate bootstrap replications under , or –as is arguably more fitting terminology in this case– a “re-ranking” scheme.
The following bootstrap scheme does just that:
Find a minimizer of over nonnegative matrices satisfying .
For each , construct an estimator of the linear predictor of given under the null hypothesis . Under , this linear predictor is defined as
and can be estimated by means of
Here we use the notation .
Construct an estimate of the measurement error covariance matrix, where , where is a minimizer of over nonnegative matrices satisfying for some choice111The value of should in principle be chosen to be larger than the true rank. This can be done either by a visual inspection of the traditional scree plot, choosing an beyond the usual elbow location; or by inspection of the “off-diagonal” scree plot, i.e. the mapping , again choosing a value of beyond the elbow that is expected to happen around . We further comment on this choice in later remarksof . If we suspect homoskedasticity of the error structure we can estimate the common variance by .
Draw i.i.d. observations from an
-dimensional centered Gaussian distribution with covariance matrix, where . For the homoskedastic case, use the modified the covariance matrix . Note that these “residuals” serve as a proxy for the unobservable residuals , where .
Define the -vectors for . Construct bootstrap samples of size , comprised of -vectors sampled uniformly with replacement from . Let be the empirical covariance matrix of the th bootstrap sample.
Let be the empirical
-quantile of the collection, where
To test the pair at level , use the test function
or equivalently use the -value
The next remark explains the heuristic behind this construction. It can be skipped at first reading, moving on to the next Theorem, which treats the question of validity.
Remark 1 (Heuristic Explanation).
As part of the proof of Theorem 2, we show that under , any minimiser converges to in probability as , where is the unique minimizer of over nonnegative matrices satisfying . Since , then under , this unique minimizer must be the covariance matrix of the rank SVD truncation of . In other words, , where the ’s and the . Now consider the linear predictor . This satisfies, and . Thus, if denotes an independent zero mean random vector with covariance matrix , then is distributed as . The important thing to now observe is that the mean and the covariance structure of is the same as that of , where is a random vector with zero mean and covariance matrix , and is an independent zero mean random vector with (diagonal) covariance matrix . Indeed, their common mean is zero and the common covariance matrix is . Thus, one can view as being distributed (up to second order) as a “rank vector + measurement error” for any . Equivalently, it can be viewed as a random sample under the null hypothesis . Indeed, the test statistic is determined by the second order structure alone. On the other hand, for , it follows from Theorem 1 that . Therefore, , and consequently, is being approximately (upto second order structure) distributed as the original , which is indeed a “rank vector + measurement error”. The idea of the bootstrap procedure is to materialise this heuristic, replacing unknown elements with their “hat counterparts” as naturally estimated in the given context.
To establish validity of the re-ranked bootstrap procedure for testing the pair , we will demonstrate that in large samples we have , under , while have under , where ‘’ denotes stochastic order of scalar random variables,
This second property formalises what is required for the centring (“re-ranking”) to work: that it generates small values of even when is not the true rank. Taken together, the two properties will imply that the test based on the bootstrap procedure will respect the level under the null, and provide nontrivial power under the alternative. Indeed, as to what concerns power, we recall that diverges under , so if we can establish tightness of under then we will have consistency in addition to the sought stochastic ordering (see Corollary 1).
Theorem 3 (Bootstrap Validity).
Let denote the empirical measure induced by sampling observations uniformly with replacement from
where and ) for some given minimiser of . Let denote the covariance of the law , and let denote the (random) distribution function of the re-scaled bootstrap statistic . Suppose that there exists a rank factorization, say, of such that is invertible. Then, under assumptions (C) and (G) it holds with -probability 1 that
where is the distribution function of the random variable
and the random vector is the weak limit of , where is the empirical covariance of the ’s. Furthermore, if is true, then the limit of coincides with the limiting distribution of , provided the are Gaussian.
Corollary 1 (Nearly perfect power under any level).
Let denote the distribution function of and be the bootstrap distribution, as in Theorem 3. If (C) and (G) hold, and if are Gaussian,
Under , with -probability 1.
Under , with -probability 1.
Here is the total variation distance of probability measures with distribution functions and .
Before moving on to issues computational as well as the finite sample performance of the scheme, we give make some additional remarks to help clarify the message conveyed by the theorem and its corollary (these can be skipped at first reading).
(a) The astute reader will notice that the Theorem considers a slightly simplified “semi-oracle” setting, where some otherwise unknown objects are assumed accessible for use in simulation. This is done in order to allow for a tractable analysis, and as non-technical a statement as possible, similar in spirit to simplifications employed, for example, in Delaigle and Hall (2012) (see equations (2.4) and (2.5) along with Theorem 2.1). The key complication in considering the fully empirically accessible version of the theorem, is not so much conceptual, but rather technical in nature. For instance, the fact that a minimisers and may not be unique implies that one would need to consider a substantially more sophisticated asymptotic regime, without truly gaining additional insight.
(b) Since the ’s are approximately distributed as the ’s, the above theorem gives a theoretical justification as to why the procedure based on bootstrapping the ’s will be able to approximate the bootstrap -value under the null hypothesis at least under Gaussianity. The Gaussianity assumption is needed to ensure that the fourth order moments are determined by the second order moments, and consequently, the distributions of and coincide under the null hypothesis .
(c) When Gaussianity cannot be assumed, it is still true that the asymptotic distribution of approximates the asymptotic distribution of under , albeit under a different rank process (determined by the spectrum of ) than the rank truncation of the original –process (which is determined by the first eigenvalues and eigenfunctions of ). To gauge the quality of this approximation, investigate the performance of the procedure under departures from Gaussianity in Section 3.
(d) The statement of Corollary 1 makes precise the following fact: since the re-scaled bootstrap statistic has a tight weak limit under whereas escapes to infinity under , test based on the bootstrap will have an asymptotic power of 1 no matter what the chosen significant level is (equivalently, the bootstrap -values will converge to zero under ). This suggests a practical strategy when the sample size is sufficiently large: to choose a small value , i.e. perform a conservative test. This is corroborated by means of simulation in Section 3.
We now discuss the computational aspects of the test statistic. Recall that
Even the simplification (removal of the rank constraint) of the first minimization, as obtained by the second one, remains a non-convex optimization problem. Thus, we cannot guarantee that standard techniques, e.g., gradient descent, will converge to a global minimum. Note that there are infinitely many minima due to the fact that if is a minima, so is for any orthogonal matrix.
However, recent work by Chen and Wainwright (Chen and Wainwright, 2015) shows that projected gradient descent methods with a suitable starting point have a high probability of returning a “good” local optima in factorized matrix completion problems. For our simulation study, we used the in-built solver optim in the R software with starting point , where is the spectral decomposition of , is the matrix obtained by retaining the first columns of , and is the matrix obtained by keeping the first rows and columns of . Although we do not exactly use the approach by Chen and Wainwright (2015), it is seen in the simulations that our way of solving the minimization problem converges reasonably quickly and yields stable results.
An important requirement in our test procedure is the choice of . One way of doing so would be to use the band-deleted scree plot. This is defined as the graph of the function
The graph of the above function is expected to be strictly positive for all , and it is expected to be very close to zero for all . So, the point where the graph almost merges with the x-axis should give a good initial estimate of the true rank of the functional data. However, such a choice of would then be data-driven and would be beyond the complexity of the asymptotic study done in the previous section, which needs a fixed choice of . An option is to split the sample into two parts, one part of choosing using the band-deleted scree plot, and use the other part with this choice of (which would now be fixed, conditional on the first split sample) to conduct the proposed procedure. For the asymptotics as in the previous section to go through, one would only need the size of the second sample to be of the same order as the original sample so that its size grows to infinity with the total sample size.
Although our procedure involves solving a non-convex problem including a subsequent bootstrap procedure, the computational time is quite reasonable in all the simulations that we carried out. Our algorithm, when run on a 64-bit Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz machine with 16 GB RAM, took 9.6 seconds to run one iteration of the bootstrapped test procedure when the sample size is and the grid size is .
3 Simulation studies
We will now investigate the finite sample performance of the procedure for estimating the true rank using bootstrap tests as discussed in the previous section. Let denote the Karhunen-Loève expansion of the true unobserved process. The principal component scores satisfy and for all . We observe for and , where are equispaced grid points. Also, for each .
3.1 Homoscedastic errors
We first consider some models with homoscedastic measurement errors, specifically:
Model A1: , , , , , , , and for all .
Model A2: The model parameters are almost the same as Model A1 except that we now set , and now has a mixture distribution that is with probability and with probability . Thus, the
-paths are somewhat “curvier” and the principal component scores follow skewed Gaussian mixture models. The latter is chosen to investigate the behaviour of the bootstrap procedure for non-Gaussian models (see point (b) of Remark2).
Model A3: , , , ,