1 Introduction
In data analysis, it is natural to first understand and reduce the explanatory data before fitting a regression (see, e.g., [1, 2, 4, 17, 23]). Aside from principal component analysis (PCA) [43], a variety of different approaches (like UMAP [30], MDS [12]
[33, 40, 41], SIR [28], KPCA [36], random projections [5], etc.) have become increasingly popular lately. There is a similarly large variety of regression and function estimation methods (like neural networks, kernel methods, decision forests, etc.). This paper aims to shed some light on the estimation errors that occur when first applying a dimensionality reduction method before fitting a regression. We focus on two frequently used and studied methods: PCA and kernel regression.
If is the explanatory data and is an output variable in a standard regression framework, the goal is to find a regression function such that
is minimized. The optimal regression function given perfect knowledge of the underlying distributions is given by . When first applying a dimensionality reduction map on , the goal is instead to find a regression function minimizing
Reducing the dimension before performing a regression has several advantages, especially in view of computational aspects and interpretability of the occurring features and the obtained regression function (see, e.g., [9, 20, 26]). An aspect that is of crucial importance in any regression framework is sample efficiency
, i.e., how much sample data is required to obtain good approximations of the true regression function. Usually, sample efficiency tends to deteriorate in higher dimensions, which is an instance of the curse of dimensionality. Studying sample efficiency in combination with dimensionality reduction is a focus of the analysis in this paper.
One the other hand, dimensionality reduction decreases the predictive power of the explanatory data. Indeed, in terms of information content, it is clear that for any noninvertible transformation , contains at most as much information as does about (see, e.g., [2]). However, in many cases where dimensionality reduction is applied, the implicit assumption is that the reduced data retains almost all relevant information encoded in , and thus the fit should be almost as good as . For PCA, the structural error
can be controlled via the eigenvalues of the covariance matrix of
, see Proposition 3.7.In sample based regression, only finitely many i.i.d. copies of
are given without knowing the true distribution of . For kernel regression, the convergence of the sample based regression functions for is a well studied problem (see, e.g., [8, 32, 38, 44]). When including a prior dimensionality reduction step, however, new errors occur that need to be controlled.
Let denote the estimator of the dimensionality reduction map using and let denote the estimated regression function such that for . Comparing to , we observe two different error sources: First, obviously both and include the error introduced by sampling. Second however, a crucial error that is introduced for is that already the input to its estimation procedure, , differs in distribution from the data which is used to reconstruct . This means, the sample points which are used to estimate are not merely i.i.d. copies of which is used to characterize . Thus, the question of how close is to requires a suitable stability of kernel regression. In Theorem 2.8, we obtain a stability result which can control the difference between two kernel estimators when the difference of the underlying distributions is measured in the first Wasserstein distance. Subsequently, it is applied to the case where perturbed data is coming from an estimation error of PCA in Proposition 3.3. As the subspaces spanned by estimated PCA and true PCA map may have entirely different support, it is crucial that the stability result in Theorem 2.8 works with a metric distance like the Wasserstein distance, while stability results from the literature that work with the total variation norm or similar distances (see, e.g., [10, 11, 45]) cannot be applied here. While we focus on PCA as a standard and well studied dimensionality reduction approach, our stability result allows also to study the impact of more advanced methods. These are especially necessary if the explanatory variables cannot be described by a linear subspace, but, for instance, by a submanifold.
Intuitively, one expects that first reducing the dimension of the data and then performing a regression on the lowdimensional data should be as sample efficient as starting with the lowdimensional data directly. As just discussed, this intuition ignores the added difficulty of the estimation error for the dimensionality reduction map, and in particular that this error is propagated by the stability of the lowdimensional regression. We obtain error estimates in Theorem 3.4 and Corollary 3.5 for the overall procedure of PCA and kernel regression where the errors coming from this source are indeed restrictive, and thus the overall error estimates might be worse than starting directly with lowdimensional data. Thus, the analysis in this paper suggests that two considerations for dimensionality reduction in terms of sample efficiency have to be taken into account: On the one hand, the actual regression on the lowdimensional space is usually faster (see Section 2.2 for a quantification thereof), but on the other hand, a new error through estimation of the dimension reduction map is introduced. This comparison is discussed in more detail in Section 3.2. We conclude that the twostep approach is particularly useful if the dimension of the input data is large and the kernel is not too regular. These theoretical findings are confirmed in a numerical illustration.
We emphasize that the studied setting in this paper differs from the frequently studied question of simultaneously estimating a dimensionality reduction map and an estimator such that , which is for instance the case in multiindex models and related fields (see, e.g., [3, 16, 18, 19, 29, 34]). The method of simply applying an offtheshelf dimensionality reduction method like PCA certainly leads to worse performance on regression tasks, as the dimensionality reduction map is estimated using just the data, ignoring the output variable
. Then again, such an approach has other benefits, like a clear interpretation of the reduced data and being able to use the same reduced data for different regression tasks. Another considerable benefit is in semisupervised learning (see, e.g.,
[48, 49]): If a larger sample for than for pairs is given, one can use the whole sample for the estimation of the dimensionality reduction map. If the sample for (compared to sample pairs) is sufficiently large, Proposition 3.6 shows that the error arising from the estimation of the PCA map is no longer restrictive for the overall error of the PCA and kernel regression procedure, and thus the sample efficiency of the lowdimensional kernel regression is recovered.Finally, we mention that related objectives to this paper have recently been studied from different perspectives, like dimensionality reduction for kernel regression based on partitioning the space [21, 22], or related stability questions by studying outofdistribution performance of kernel methods [7, 14].
The remainder of the paper is structured as follows: We introduce the kernel regression setting in Section 2.1 and analyze the basic influence that dimension and dimensionality reduction methods have on sample efficiency in Section 2.2. The Wasserstein stability result for kernel regression is given in Section 2.3. Section 3 studies the combined procedure of PCA and kernel regression. Section 3.1 states basic results from the literature on PCA and Section 3.2 states the results on the estimation error. Section 4 gives numerical examples illustrating the procedure.
Throughout this paper, we denote by the Euclidean norm. For matrices , we denote by the operator norm and the HilbertSchmidt norm. For a reproducing kernel Hilbert space (RKHS) , we denote by the uniform norm on the space of bounded linear operators mapping to .
2 Kernel Regression: Dimension dependence, Convergence Rates, and Stability
This section gives results related to kernel regression. First, in Subsection 2.1 we introduce the kernel regression framework and state results from the literature on convergence rates. In Subsection 2.2, we give basic intuition and results on how kernel regression behaves under simple transformations of input measures, namely linear dimensionality reduction or inclusion of independent noise. And finally, Subsection 2.3 studies stability of the optimal kernel regression function when the input data is perturbed.
2.1 Kernel Regression Setting
We are given an input space and an output space for some
. The learning problem we are interested in is governed by a probability distribution
on , where the goal is to find the optimal predictor of the output variables, given the input variables. This means, the goal is to solvewhere the is taken over all measurable functions mapping to . Writing , the solution is given by the conditional mean
However, is not known precisely, and only finitely many sample pairs
are observed, which are independent and identically distributed according to . To approximate using these finitely many pairs, we introduce the regularized kernel regression problem
where is a regularization parameter and is a reproducing kernel Hilbert space (see, e.g., [24, Section 8.6], [13]) of functions mapping to . We use the notation for the kernel associated with , and always assume that the kernel is normalized to . We further introduce
(1)  
(2) 
which is called the population counterpart to , where we note that if is the empirical distribution of the given sample, then . Defining and by
we recall that the solution of (1) is given by (see, e.g., [8])
Finally, we introduce the clipped estimator
and similarly . The purpose of clipping is that, since only takes values in , it holds and in certain aspects is more tractable than (see also the discussion in [38] and references therein).
We will require an analogue of acting on
. To this end, for a probability measure
on , we define the kernel integral operator byTo study the approximation of , two key parameters were identified in the literature to describe the convergence rate in for a suitable choice of :
Assumption 2.1.

Let such that there exists a constant so that the sequence of nondecreasing eigenvalues of satisfies .

Let such that there exists a constant so that .
Roughly, the parameter describes the complexity of the measure in terms of the kernel while measures the approximation error of the function with functions in . We briefly discuss and as follows.
Example 2.2.
Since , is compact and we have [37, Theorem 4.27]. If the kernel is times continuously differentiable and
is the uniform distribution on the Euclidean unit ball in
, where , then Assumption 2.1(i) is satisfied for , see [38, page 3]. In particular, regular kernels lead to smaller and a faster decay of . On the other hand, the decay suffers from large dimensions.In view of (2), Assumption 2.1(ii) quantifies the approximation quality of in a ball. The function occurring in Assumption 2.1(ii) is also called the approximation error function
. It is directly related to the interpolation properties of the RKHS, see
[37, Chapter 5.6]. For instance, if is a Sobolev space of order , then the kernel is times differentiable and we can choose for Sobolev regular functions where (cf. [42, Chapter 11]. In particular, for a fixed regularity of the regression function very regular kernels lead to small .In this paper, we mainly use results by [38], while in related settings very similar convergence results are obtained for instance by [8, 32, 42, 44].
Remark 2.3.
The following inequality is a particular case of [38, Theorem 1] (by choosing and therein): There exists a constant , such that for , it holds
with probability at least with respect to the fold product measure . The dominating terms on the right hand side of the inequality are and for , . Optimizing for yields and a resulting learning rate of
with probability at least .
In the situation of Example 2.2 with a Sobolev space of order and a Sobolev functions , we recover the classical minimax rate if . In particular, the rate deteriorates for large dimensions .
2.2 Dimensionality Reduction and Transformations of Measures
In the previous section, the two parameters and were introduced which characterize the properties of in relation to the data distribution necessary for the rate of convergence, see Remark 2.3.
When transforming data, it is thus important to study how these parameters change when the data distribution changes. In particular, we focus on two simple yet important cases: First, a linear dimensional subspace of is transformed into a parametrization in , see Lemma 2.4. And second, independent noise is added to the input data, see Lemma 2.5.
For the following Lemma 2.4, the input data is changed with a linear map . We sometimes identify by the corresponding matrix . We assume that has orthonormal rows, where one may think of the PCA matrix containing the largest eigenvectors of the covariance matrix of the data. Formally, the transformation of the input data is the following: If is distributed according to , we set as the distribution of , which is a probability measure on , where . We also assume that is invertible on , meaning that there is an inverse having orthonormal columns and . The interpretation of this assumption is that the data distribution really does lie on a dimensional plane, see Figure 1.
Simple case where highdimensional data really lies on a lowerdimensional linear subspace and can be transformed with an orthonormal mapping as assumed in Lemma
2.4.Lemma 2.4.
Assume that the kernel is of the form for some . Then the learning problems for and as defined above are equivalent in the following sense:



The parameters and from Assumption 2.1 can always be chosen equally for the learning problems with and .
The proof of Lemma 2.4 is given in Section 5. We quickly mention that the result comes as no surprise, since both kernel matrices and output variables of the learning problems with and can be transformed into each other by invertible maps. Nevertheless, the result is important in the sense that it perfectly fits the situation of PCA. For this setting, the result formalizes the intuition that it does not matter whether one uses the data as dimensional points lying on a linear plane or the respective dimensional parametrization thereof.
The following result emphasizes the potential benefit of going from a true highdimensional space to a lowerdimensional one. At least in the extreme case, where all one does is filter out independent noise, this can only improve the eigenvalue decay speed of the kernel integral operator and thus improve on the parameter . Formally, we choose some noise distribution on and denote by the convolution of with . This means, if and are independent, then .
Lemma 2.5.
Assume that the kernel is of the form for some . Denote the sequences of decreasing eigenvalues for , and by , and , respectively. Then
holds for all .
Lemma 2.5 implies that the decay rate of eigenvalues for the parameter in Assumption 2.1 is faster for compared to , and hence when adding noise, the influence of goes in the direction of a slower learning rate. In terms of , filtering out noise can thus only improve the learning rate. The proof of Lemma 2.5 is given in Section 5.
While the above result only deals with the parameter , a tractable example to see the influence of the observation error on the regularity parameter is the errorsinvariables model (see, e.g., [31]). In this model, with centered observation errors independent of and , and we observe with . Then and if and admit (Lebesgue) densities (denoted by and , respectively, with some abuse of notation), we obtain the representation for the pertubed regression function :
Although there is a smoothing due to the convolution with , in an asymptotic regime with (in distribution), the above expression converges to and inherits the regularity properties of .
2.3 Stability Result
This section presents a general stability result for kernel regression with respect to the Wasserstein distance. In Section 3.2, this result will be applied to the case where the error in the data arises from the estimation error of a PCA map, see Proposition 3.3. We first state the general result in Theorem 2.8, and discuss some aspects of it below.
Definition 2.6.
For two probability measures and on we define the Wasserstein distance between and by
where is the set of all probability measures on with first marginal distribution and second marginal distribution .
We will require the following assumption for the kernel .
Assumption 2.7.
Assume the form for kernel , where satisfies the growth condition .
This assumption immediately implies the bound (cf. [24, Section 8.4.2])
We note that Assumption 2.7 is satisfied for frequently used choices like the Gaussian kernel or various compactly supported radial kernels like (see also [47, Table 4.1.] for more examples). In general, any function that is twice continuously differentiable, satisfies and has bounded second derivative satisfies Assumption 2.7, as can be seen by using a second order Taylor expansion.
Theorem 2.8.
Grant Assumption 2.7. Given and three distributions on such that is Lipschitz continuous and bounded by , and and satisfy the relation
(3) 
Then it holds
In Theorem 2.8 above, the assumption on the relation between and is obviously satisfied for . Further, as will later be used, it is satisfied when is an empirical measure of of sufficiently large sample size, relative to . The assumption of Lipschitz continuity of the kernel is made so that it fits the standard definition of the Wasserstein distance. One may, for instance, weaken the assumption of Lipschitz continuity to Hölder continuity with exponent (that is, weaken the growth condition in Assumption 2.7 to ) while simultaneously adjusting the cost function of the Wasserstein distance to .
Remark 2.9.
As we assume the kernel is both bounded and Lipschitz continuous, both constants and occurring in Theorem 2.8 can be bounded by , which can be controlled in various ways. Notably, it always holds and more generally, stronger bounds on the norm are for instance given in [44, Lemma 5.2] and indirectly by [13, Theorem 3]. In particular, if , then is uniformly bounded for all .
Example 2.10.
We exemplify Theorem 3.4 in the setting of Lemma 2.5. Say and , where and are independent. Assume
Define . Then it is straightforward to see that and thus for some constant only depending on but independent of , Theorem 2.8 yields that
As mentioned in Remark 2.9, both the Lipschitz constant and the bound on which are implicitly part of the constant can be controlled independently of by .
We shortly discuss the order of magnitude of the relation implied by Theorem 2.8 for the difference in terms of both and .
First, we give an argument for why the linear order of in likely cannot be improved: Define the kernel by for . Fix with , and and . One finds that and . The given example however is not a strict proof of the claim above. Even though the kernel is positive definite on , it does not satisfy Assumption 2.7. Nevertheless, this example still shows that for general kernels, nothing better than the linear behavior implied by Theorem 2.8 can be obtained, especially in view of the quantitative nature of the result.
Next, we discuss the bound for for varying values of . Discussing the role of in Theorem 2.8 is difficult due to the limiting behavior for : Take . Since for , the limiting object does not even have to be defined almost surely, we see that the limiting expression cannot be evaluated. In accordance with that, the bound implied by Theorem 2.8 goes to for . However, it is difficult to find and where this difference really diverges, keeping in mind that our output space is bounded. Nevertheless, Theorem 2.8 is a result for all values of , and we can at least observe the behavior for certain ranges of . While optimality for seems difficult to verify, we can look at the behavior for . In this case, the example mentioned above with is again insightful (while here, the assumption on the kernel does not matter as the behavior in is the same for different kernels). Indeed, for this behaves like , and the right hand side of the estimate of Theorem 2.8 does as well (noting that here for ).
3 Dimensionality Reduction
This section studies the combined approach of dimensionality reduction and function estimation.
3.1 Principal Component Analysis
First, we state results on PCA, which are mainly taken from [35]. For a more general treatment of PCA in the context of dimensionality reduction we refer for instance to [25, 27]. The results by [35] will later be used in our estimates for the combined study of PCA with regularized kernel regression. The goal is to reduce the dimension of the input data from dimension to . The reduced data will still be regarded as points in , albeit Lemma 2.4 shows that it does not matter whether one uses the points as a linear subset in or a dimensional representation thereof.
For the PCA procedure, we are in the same setting as given in Section 2.1. Let
(4) 
This means can be written as for having orthonormal rows. Define
(5)  
(6) 
The following standing assumption will be used to apply the results on principal component analysis.
Assumption 3.1.
Let .

is subGaussian.

, where is the sequence of decreasing eigenvalues of the covariance matrix of .

is centered: .
The following is a corollary of [35, Proposition 2] and the main reconstruction result for PCA which will be used in Section 3.2.
Lemma 3.2.
Grant Assumption 3.1. There exists a constant such that for all , , with probability , it holds
The PCA projections we work with defined in Equation (4) are regarded as mappings from onto itself. In terms of matrices, this means the projection mapping is used to get the reduced data lying on a dimensional plane. An alternative route would be to work with directly instead and obtain reduced data . Similarly to Lemma 3.2 one can obtain error estimates in this setting using the Theorem (see, e.g., [46]) combined with optimal rates on covariance matrix estimation (see, e.g., [6]).
3.2 Estimation Error for PCA and Kernel Regression
We now combine and extend the settings from Sections 2 and 3.1 for the procedure studied in this section. We are given sample pairs
which are independent and identically distributed according to . The optimal PCA map and its empirical counterpart are defined in Equations (5) and (6), respectively. We define the dimensionreduced data as and the estimated counterpart as . Further, we define as the distribution of .
We are looking for the optimal regression function given the dimensionreduced input data. The underlying distribution of the reduced data, given by , leads to the best predictor
Since we lack knowledge of both and , we do not know precisely. Thus to estimate with our finite data, we define the regularized least square kernel fit
For our error analysis we will later also require
The mappings and can actually be calculated given the sample data, while and are the respective best possible fits that can result from the employed procedure. Of main interest in our analysis is thus the error
(7) 
First, we give the specification of Theorem 2.8 to the case relevant for bounding the error in Equation (7). To state the result, we introduce some additional notation, namely , and .
Proposition 3.3.
Let and grant Assumption 2.7. Assume
(8) 
and set . Then
holds with probability at least with respect to .
For the overall error analysis, we will bound using Lemma 3.2, which will lead to a highprobability bound for of order . Further, the term is of course either bounded (if
is bounded) or at least bounded with highprobability by the weak law of large numbers, since we will always assume that
has finite second moment. Further, while
certainly may be unbounded for small in general, in various situations one may reasonably assume that uniformly over all , for instance if all estimators are uniformly bounded with respect to (cf. Remark 2.9). Finally, it is worth mentioning that the assumption in equation (8) may be restrictive in general, although in the setting of [8] the assumption is shown to be satisfied asymptotically for a suitable choice of leading to optimal rates (cf. the related discussion in [42, Chapter 13]).The following result gives the overall result for the sample based error given in Equation (7). The constants and appearing therein are as in Proposition 3.3. is the constant from Remark 2.3. The parameters and are as in Assumption 2.1 for the measure . And finally, we introduce a further constant with from Lemma 3.2.
Theorem 3.4.
In the above result, the error is split into three terms,
These directly translate to the three terms on the right hand side of Equation (9), where the first results from the Lipschitz constant of combined with the PCA error (Lemma 3.2), the second is the PCA error combined with the stability result (Theorem 2.8) and the final term is the kernel regression approximation error on the lowdimensional space (cf. Remark 2.3).
In the following, we specify the above Theorem 3.4 by optimizing for under the simplifying assumption , cf. Remark 2.9. The dominating terms in the bound of Theorem 3.4 for and on the right hand side of Equation (9) are and .
Corollary 3.5.
Instead of the procedure introduced in this Section (first PCA, then kernel regression on dimensional subspace), consider the alternative method to fit directly using kernel regression on . Let us denote the clipped estimated kernel functions by and be the parameters for Assumption 2.1 for . By Remark 2.3, the resulting learning rate is
Compared with the rate in (10), we find that it is not immediate which procedure converges faster. Indeed, if
then the direct highdimensional estimation converges faster, and vice versa.
In Corollary 3.5, the influence of from Assumption 2.1 is lost in the overall rate. More precisely, the rate behaves as if takes the (worst possible) value of . In a sense, this means that the rate is no longer adaptive with respect to the complexity of measured by the RKHS . This makes intuitive sense, since now the complexity of also influences the estimation procedure of the PCA map, which is independent of .
On the other hand, in the setting of Example 2.2 but for the low dimensional space, we can adapt the regularity of the kernel to the dimension which has two advantages: First, we can allow for much smaller regularities since might be considerably smaller than and thus the numerical stability of the kernel method can be improved (cf. [42, Section 12]). Second, the resulting can be chosen close to one and the rate in Corollary 3.5 is only slightly worse than the best possible rate for Sobolev regular regression functions .
Moreover, this comparison certainly motivates that dimensionality reduction is particularly suitable to apply if one does not expect small values of anyways. Recall that is usually small if the kernel is very smooth and the dimension is not too large. Especially for large dimensions, one could expect high values of , and thus not much is lost when applying dimensionality reduction, which basically leads to . This intuition is consistent with the numerical results in Section 4, where Figure 4 shows that utilizing PCA for dimensionality reduction leads to notable improvements for or kernels, while this is not (or less so) the case for the Gaussian kernel, which is .
The twostep procedure described has another interesting possibility. Consider the case where more sample points are given compared to
sample pairs, which is a situation frequently observed in semisupervised learning settings. Then, the error bounds for the PCA map can use all
samples, even though the function estimation can only use the pairs. This can improve the overall estimates of the whole procedure.Let and additional many samples be given and we consider the case where the PCA map is estimated using all many samples.
Theorem 3.6.
For sufficiently large , we thus recover the rate from Remark 2.3 in the lowdimensional space.
Finally, Proposition 3.7 gives a direct result on controlling the structural error of the procedure under a Lipschitz condition for . It shows that the magnitude of the error is controlled by the smallest eigenvalues of the covariance matrix of .
Proposition 3.7.
Let be Lipschitz continuous with constant . Then it holds
where are the eigenvalues of the covariance matrix of in decreasing order.
4 Examples
In this section, the overall procedure of dimensionality reduction and kernel regression is illustrated in a simple setting. We mainly aim to shed some light on the comparison of the occurring errors for direct regression versus inclusion of dimensionality reduction, as discussed above. Due to computational constraints, the insight of these examples towards true asymptotic behavior is of course limited, but we believe the illustration for smaller sample sizes and visualization of the absolute errors involved can nevertheless be insightful.
Let and . The data is generated as follow: Consider to be uniformly distributed on the grid , where we set . We define to be some rotation (with an orthogonal transformation) of . The first two eigenvalues of the covariance matrix of are thus and the remaining ones are .
As in Section 3.1, we denote by the PCA map and the estimated counterpart. The excess reconstruction error for the PCA procedure depending on the number of sample points used, which is studied in Lemma 3.2, is showcased in Figure 3. The shown error, after an application of Markov’s inequality, yields Lemma 3.2 which governs the term in the stability result in Proposition 3.3. We see that the (predicted) linear behavior in arises.
For the data, we treat two different cases. Roughly speaking, the cases differ in the sense of whether the dependence structure of includes the noise inherent in or not. For both cases, we fix a function , which we (rather arbitrarily) define by .
For the first case, we define