A semi-group approach to Principal Component Analysis

Principal Component Analysis (PCA) is a well known procedure to reduce intrinsic complexity of a dataset, essentially through simplifying the covariance structure or the correlation structure. We introduce a novel algebraic, model-based point of view and provide in particular an extension of the PCA to distributions without second moments by formulating the PCA as a best low rank approximation problem. In contrast to hitherto existing approaches, the approximation is based on a kind of spectral representation, and not on the real space. Nonetheless, the prominent role of the eigenvectors is here reduced to define the approximating surface and its maximal dimension. In this perspective, our approach is close to the original idea of Pearson (1901) and hence to autoencoders. Since variable selection in linear regression can be seen as a special case of our extension, our approach gives some insight, why the various variable selection methods, such as forward selection and best subset selection, cannot be expected to coincide. The linear regression model itself and the PCA regression appear as limit cases.

Authors

• 4 publications
• 1 publication
12/31/2018

The Stochastic Complexity of Principal Component Analysis

PCA (principal component analysis) and its variants are ubiquitous techn...
02/21/2020

Sparse principal component regression via singular value decomposition approach

Principal component regression (PCR) is a two-stage procedure: the first...
01/28/2019

Secure multi-party linear regression at plaintext speed

We detail a scheme for scalable, distributed, secure multiparty linear r...
07/15/2021

Principal component analysis for Gaussian process posteriors

This paper proposes an extension of principal component analysis for Gau...
03/08/2017

Exact Dimensionality Selection for Bayesian PCA

We present a Bayesian model selection approach to estimate the intrinsic...
02/12/2020

Structure-Property Maps with Kernel Principal Covariates Regression

Data analysis based on linear methods, which look for correlations betwe...
05/30/2020

An Analytical Formula for Spectrum Reconstruction

We study the spectrum reconstruction technique. As is known to all, eige...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA), introduced by Pearson (1901), has been one of the most commonly used statistical methods for reducing the complexity in datasets: for samples with a high number of features

, a linear subspace is chosen in favor of a simpler representation of the data. Under the assumption of existing variances, these applications have been justified by many theoretical results and have been applied successfully in a wide variety of scientific fields. Overviews can be found in

(Hastie, 2009; Jolliffe, 2002; Ringnér, 2008)

, for instance. Some special care is needed when the distribution of the data deviates considerably from the multivariate normal distribution. In particular, probability laws without second moments or counting distributions may lack theoretical justification or a good interpretation when PCA is applied

(Jolliffe, 2002).

Several generalizations have been considered (Bengio et al., 2013; Candès et al., 2011; Hofmann et al., 2008; Silverman, 1996; Vidal et al., 2016)

. A particularly flexible one is the autoencoder, a neural network architecture that has been considered both in mathematics and statistical learning

(Baldi and Hornik, 1989; Jolliffe, 2002; Oja and Karhunen, 1985)

. The autoencoder is based on the formulation of a regression problem, where the data is reconstructed by a map that simplifies the stucture in some sense. Often, simplification is meant in terms of mapping into a lower dimensional space and then back to the original space of the data. Here, the admissible maps are possibly nonlinear and chosen such that a certain loss function is minimized. Classic PCA appears as a special case of choosing linear maps.

Algebra plays a dominant role in founding certain areas of probability theory, in particular stochastic processes

(Sasvári, 2005; Schilling et al., 2012; Strokorb and Schlather, 2015). When special problems are adressed, modern algebra has also found applications in statistics, for instance, in design of experiments (Bailey, 2004) or in learning theory (Watanabe, 2009)

. Surprisingly, when data themselves are considered, ‘linearity’ usually refers to vector spaces over the field of real numbers, although many random variables exhibit natural linearity with respect to other operations

(Golan, 1999; P. Prakash, 1974). An exception in extreme value theory is Gissibl et al. (2021) who refer to the tropical algebra, however in a different context than PCA. Distributions with a stability to some given algebraic operation typically stem from limit laws and hence are infinitely divisible, again not necessarily with respect to the field of real numbers (Davydov et al., 2008). In our algebraic approach, the classic PCA reappears as the Gaussian case.

A particular class of distributions without guaranteed second moments, which exhibits linearity with respect to maxima instead of addition, are the max-stable distributions (L. de Haan, 2006; Resnick, 1987; Stoev and Taqqu, 2005)

. In contrast to the multivariate Gaussian distribution, the dependence structure of max-stable distributions cannot be fully described by bivariate characteristics like the covariance

(Beirlant et al., 2004; Strokorb and Schlather, 2015). Hence, decomposing any such derived matrix cannot be sufficient, at least from a theoretical point of view (Jiang et al., 2020).

We consider here also intrinsically vector-valued data, which appear in colour coding, for instance. Special cases there of are matrix-valued data, which appear in a single measurement, for example in functional magnetic resonance imaging (Wang et al., 2016)

. Linear regression models can also be seen as a special case of vector-valued data. Both variable selection and classic PCA are dimension reducing methods. The ideas are frequently combined, leading to the PCA regression analysis or the sparse PCA, for instance

(Hastie, 2009). Nonetheless, variable selection and PCA have been considered as different methods.

Central part of the paper is the definition of the generalized PCA in Section 3. It is based on generalizations of several well-known notions, such as stable distributions and quadratic variation (Section 2). An important specification of our approach is the PCA for extreme values (Section 4). Some background information is given in the appendix.

2 Foundations

Since classic PCA minimizes the mean square of the residuals (Pearson, 1901), calculating the difference between random variables is implicitly required. In our generalization towards extreme values with Fréchet margins, we replace the abelian group by the semi-group where . Since the calculation of a difference is impossible in a semi-group context, we provide a workaround for the mean square of the residuals, here. First, we have to declare for which random vectors we have a workaround (Section 2.3). Essentially, these vectors have a stable distribution (Section 2.2). In Subsection 2.6, we define a convenient distance between random vectors, which avoids the calculation of residuals. This semi-metric is based on a semi-scalar product (Subsection 2.5), which itself is based on a kind of valuation principle (Subsection 2.4). The latter is fundamental, since it (i) generalizes the quadratic variation, (ii) is unique in important cases and (iii) throws new light on the variance.

2.1 Semigroups and Semirings

Since semigroups are not that frequently used in a statistical context, we repeat basic notions. See Golan (1999) for a general introduction, for instance. Throughout the paper, we will use , ,, and for the binary operator of the standard addition, the maximum, a general semigroups and a general semiring, respectively. The corresponding multi-operators are denoted by , , and .

Definition 2.1.

Let be a nonempty set and be an associative operation, then the tuple is called a semigroup. A semigroup is called

1. a monoid with identity element , if an element exists, such that for all .

2. commutative, if for all .

3. topological, if the set has a topology , is a topological space and the map is continuous.

Definition 2.2.

A set with addition and multiplication is called a semiring if

1. is a commutative monoid with identity element ,

2. is a monoid with identity element ,

3. multiplication is left and right distributive, i.e., and

4. for all .

Example 2.3.

Examples of practically relevant semirings are, for instance, with , , , and , where denotes the matrix multiplication. If is a non-empty set, then is also a semiring. Last but not least, the quaternion number system is a semiring, which is used in certain areas of physics, see Menanno and Mazzotti (2012), for instance.

Essentially, the definition of a semiring means that the inverses with respect to addition and multiplication are missing and that the multiplication is not necessarily commutative. However, we will always assume that is commutative, and have at least two elements, and that and are topological. Since the focus of the paper is on algebraic aspects, we assume for ease that both, and , are Polish. Further, we will drop the multiplication sign in formulae whenever possible. On the other hand, especially when several different semigroups or semirings are involved in a formula, we may clarify neutral elements and operators with indices.

In our set-up, the “scalar” random variable takes values in a monoid . Much more structure will be imposed on the set , which indexes the distributions and which will be a semiring. Primarily, it is this index set that is extended to higher dimensions, the so-called semimodule.

Definition 2.4.

Let be a commutative semigroup and a semiring. Let be a mapping that satisfies for all and the following properties:

 α⊙(β⊙x)=(αβ)⊙x α⊙(x⊕y)=(α⊙x)⊕(α⊙y) (α¨+β)⊙x=(α⊙x)⊕(β⊙x) 1R⊙x=x,0R⊙x=x⊙0S=0S.

Then is called a semimodule over . A subset that obeys the above conditions is called a subsemimodule. If is a topological semigroup and is continuous, the semimodule is called topological. We write and instead of , if is canonic, e.g., if for some .

Definition 2.5.

Let be a semimodule and . Let be the cardinality of . The value

 \rm rankS:=min{#B:S=\rm\rm span(B)}

is called the rank of .

Note that the span in the preceding definition is calculated according to the semimodule operations. Linear maps will play an essential role for the reconstruction of the points.

Definition 2.6.

Let and be topological semimodules over the same semiring , with the commutative semigroups and . A map satisfying the conditions

 H(λx)=λH(x) ∀λ∈R,x∈S1 H(x⊕S1y)=H(x)⊕S2H(y) ∀x,y∈S1

is called linear. If and are both canonical, then is called linear.

Remark 2.7.

If and , a linear map can always be represented by a matrix such that

 Hx:=(..∑j=1,…,dHijxj)i=1,…,p,x∈Rd.

Although the definitions above are in analogy to the definitions of a vector space and a linear mapping, the consequences of the transition from groups to semi-groups are severe. For instance, the dimension of a subspace of a finite dimensional space does not necessarily exist. Appendix A gives some implications that are particularly important when dealing with extreme values. Appendix A also delivers implicitly arguments, why a constructive approach via explicit multivariate distributions is chosen to define the generalized PCA, and not via an abstract formulation based on subsemimodules or on the rank of a matrix.

2.2 Stable distributions

For a general approach to PCA without existing variance we need a generalized notion of stable distibutions, where we replace the standard addition by an arbitrary semigroup operation. Some additional care is needed with respect to the scaling properties of random variables. The following definitions provide the structure to develop a useful theory.

Definition 2.8.

Let be a topological monoid and a semiring with an additional binary operation . Let be a set of distributions over and , , measurable maps, such that

 H1 = idG H0 ≡ 0G (1) Hμ(Xν) ∼ Fμν,μ,ν∈R,Xν∼Fν Xμ∔Xν ∼ Fμ∘ν,Xμ∼Fμ,Xν∼Fν\rm independent. (2)

Then, the set is called a stable set of distributions. We write briefly instead of .

The following definition ensures that transformations of random vectors have still the required distribution, see Proposition 2.15 below.

Definition 2.9.

Let be stable set of distributions where all , , are linear, then is called linear.

Example 2.10.

In case of symmetric stable distributions , , we have for that

 G≡R = R Hσ(x) = σx σ∘τ = (σα+τα)1/α.

Hence, the set of symmetric stable distributions is stable and linear. Here, the Gaussian case is included as , see Samorodnitsky and Taqqu (1994).

Example 2.11.

Matrix-valued data can be considered as vector-valued with special constraints, which are modelled as a subsemiring of the semiring of matrices. An example of such a subsemiring is a set of block diagonal matrices with fixed block structure. Let us consider here vector-valued data with values in , . Let

 (G,∔) = (Rk,+) (R,¨+,⋅) = (Rk×k,+,∗),

where is the standard matrix multiplication. Then is an abelian group and is a non-commutative ring. We identify with the set of -linear maps, i.e.

 HA:Rk→Rk,x↦Ax\rm for A∈R,

so that

is the identity matrix, for instance. The

, , are not distinct, since for any we have

 AX1∼BX1 ⇔ AA⊤=BB⊤.

Finally, denote by a (through some alphabetic ordering uniquely defined) square root of a positive semidefinite matrix . Then, for and two independent random matrices we have

 A∘B = (AA⊤¨+BB⊤)1/2.

Thus, the set of -variate Gaussian distributions is stable and linear.

Example 2.12.

Let for some as in the previous example. We switch to the standard notation. Let be the subsemiring of matrices of the form

 A=(AσAβ0(k−1)×1Aμ1(k−1)×(k−1)),

where , , , denotes the unity matrix, and shall imply and . Let for

 Sℓ := {A∈P:Aβ,j=0,j≠ℓ}, Lℓ := \rm\rm span(S1,…,Sℓ)={A∈P:Aβ,j=0\rm for j>ℓ}.

Then, for , the sets and are subsemirings of . We interprete this set-up as a framework for linear regression models. Let

be the predictor variables which are typically assumed to be independent of the error

. Since we aim to show later on, that variable selection in linear modelling is a special case of our PCA, we assume that has any multivariate Gaussian distribution. Let . Then,

equals the standard deviation of the error if

. The first component of equals the dependent variable , i.e.,

 y=(AZ)11=k−1∑i=1Aβ,iXi+Aσε. (3)

The set corresponds to a simple linear regression model based on , for . The set , , denotes the models, where only the predictor variables , , are considered. In our example, the intercept, which is crucial in practice, is missing, for ease of theoretical reasoning. The family of distributions corresponding to the , , is a stable set, if consists of independent Gaussian components. I.e., it can be shown that one of the roots is in if and are. Unfortunately, this is a trivial case for variable selection. For a general distribution of , a representation of the linear model as a stable family is unkown. Fortunately, itself is rich enough so that the PCA can be applied, cf. Example 3.5.

Definition 2.13.

Let be a linear, stable set of distributions. If allows the division by in the sense that

 ◯i=1,…,nνn=νn∘…∘νn=1

for some and any , then is called a set of infinitely divisible distributions.

2.3 Multivariate distributions

Stable sets of multivariate distributions are already covered by Definition 2.8. Here, we consider an alternative, constructive definition for a multivariate version of a stable distribution that is tailored for our generalized PCA and avoids existence problems. Recall that is an abbreviation for .

Definition 2.14.

Let a stable set of univariate distributions and . Let be the set of distributions given by

 νW:=.∑j=1,…,n(ν1j,…,νdj)⊤Wj:=.∑j=1,…,n(ν1jWj,…,νdjWj)⊤ (4)

for all

 ν=(νij)i=1,…,d;j=1,…,n∈Rd×n,

independent random variables , , and . We write

 νW∼Fdν,μ

with . Let be the weak closure of . Then, an element of is called a -variate -distribution.

We call a multivariate model. It is called linear if is linear.

Definition 2.14 ensures that the univariate margins of are in . A linear multivariate model ensures that the -variate margins of are in .

Proposition 2.15.

Let be a linear multivariate model, , , and

 ξX := (ξk⋅X)k=1,…,p\rm with ξk⋅X=.∑i=1,…,dξkiXi.

Then

 ξX∼Fξν,μ. (5)

Proof.

 ξk⋅X=.∑i=1,…,dξki(.∑j=1,…,nνijWj)=.∑i=1,…,d.∑j=1,…,nξkiνijWj=.∑j=1,…,n(..∑i=1,…,nξkiνij)Wj.
Remark 2.16.

Both Definition 2.8 and Definition 2.14 can be generalized slightly, replacing by , which is then applied to random vectors , only. Then, Equation (4) is rewritten as

 X=.∑j=1,…,n(Hν1j,μi,…,Hνdj,μj)⊤Wj:=(.∑j=1,…,nHνi,μjWj)i=1,…,d.

Assume and all are continuous and distinct, i.e., for . Then is a possible choice. Here, denotes the pseudoinverse of . In practice, monotonously increasing maps are preferred so that the are essentially unique. Hence, the generalized map suggests that our approach so far is essentially restricted to continuous distributions.

The set of Gamma distributions

with fixed scale parameter and arbitrary non-negative shape parameters obeys this generalized framework, but fails to be a stable set.

2.4 Variation

In classic PCA the mean square of the residuals is minimized. From a model-based perspective, this refers to minimizing the variance of the residuals. In our general approach, the existence of the variance is not guaranteed, so that we have to consider a general function that attaches value to a residual. We wish to minimize the sum of these attached values, but face the additional difficulty that the calculation of the residuals would need additive inverses.

We call the function that attaches values to a random variable a variation, in generalization to the quadratic variation of a Wiener process. Due to property (9) below, it might be interpreted as the number of underlying independent variables.

Definition 2.17.

Let be a stable set of distributions. A continuous map is called a variation, if the following conditions hold

 (consistent) (6) >0,μ∈R∖{0} (positive) (7) =0 (degenerate element) (8) (additive). (9)

We call a variation scale invariant if, additionally,

 (10)

We also write and for and .

Remark 2.18.
1. If is scale invariant, we have , so that rescaling of all components with the same value will not change the outcome of a PCA, provided the sets and in Definition 3.2 are also scale invariant.

2. The function is negative definite, as is any function of the form . Furthermore, is a semi-character on with the identity as involution (Berg et al., 1984).

Proposition 2.19.

If is scale invariant, then the following properties hold:

1. for the neutral element of .

2. is division free, i.e., for all with we have or .

3. Let be a non-trivial interval with standard topology. Then, for some unique .

Proof.

1. Equality (10) yields . The positivity of the variation excludes . Hence, .

2. implies that or . The positivity of the variation yields or .

3. Let . The function , is well defined on some nontrivial interval that includes and is continuous there. Since obeys Cauchy’s functional equation we get for and some . Now, assume that . Then, Cauchy’s functional equation delivers that for and some . For we have , so that . Hence . The additivity yields with . Assume . Then the continuity of the variation yields in contradiction to (9). Now, assume that and are two scale invariant variations with . Then, for all ,

 (1+|μ|α)1/α=|1R∘μ|=(1+|μ|β)1/β,

so that .

Example 2.20.

In the symmetric -stable case, the so-called covariation norm assigns the parameter to for , i.e., . It follows immediately from the properties of , see Samorodnitsky and Taqqu (1994), that satisfies the four properties of a scale invariant variation, that is, . For centered Gaussian variables with , the variation equals indeed the variance.

Example 2.21.

In case of the the stable set of -variate Gaussian distributions, see Example 2.11, the variation might be defined as

 := \rm tr(AA⊤)=k∑i=1k∑i=1A2ij,A∈R:=Rk×k.

Then, is division free if and only if .

2.5 Semi-scalar product

Property (9) of the variation gives reason to generalize the notion “uncorrelated” to random variables without existing variance. The following definition is tailor-made for scale-invariant variations.

Definition 2.22.

Let be a multivariate model, a variation and . Let and be random vectors such that their distribution and that of are in . For let be an extension of the variation to a vector . Then,

is called the semi-scalar product between and . The vectors and are called uncorrelated (positively / negatively correlated) if ( respectively ).

Remark 2.23.

Given the definition of a variation in the univariate case, the definition of the variation of a vector is not clear cut. A convenient definition is

 (11)

as it ensures (6)–(9) without further assumptions.

Example 2.24.

Let be a standard Gaussian random variable and the variation of a vector be the sum of the variation of the components. Then the random vectors and are jointly multivariate Gaussian, fully dependent, but uncorrelated according to Definition 2.22. Note that the standard notion of “uncorrelated” is defined only for scalar random variables. The generalized definition still implies that two jointly bivariate, scalar Gaussian random variables are uncorrelated if and only if they are independent.

Example 2.25.

In the max-stable case the operator is the maximum, so that and hence . In the case of -stable distributions, however, the case leads to

, so that in particular the Cauchy distribution needs its own theoretical treatment or, at least, some limit considerations.

Remark 2.26.

Definition 2.22 suggests the interpretation that two random quantities are called uncorrelated if they behave as if they were independent. This behaviour has been made precise in terms of the variation.

Remark 2.27.

Linearity of the multivariate model is not sufficient to have in Definition 2.22. As a well-known example, consider the set of univariate Gaussian distributions with , . Let and where is a random sign, i.e., . Then, , but the distribution of does not belong to .

Remark 2.28.

For two jointly -stable, scalar random variables and with scale parameter and , respectively, the codifference is defined as (Samorodnitsky and Taqqu, 1994)

and measures the difference between two variables. By way of contrast, measures the difference in variation of a sum of two dependent variables and of two independent ones. Formally, .

Lemma 2.29.

Let be a multivariate model, a variation and . Let , and be random vectors such that their distributions and those of , and are in . Let . Then, the following assertions hold:

 ⟨X,X⟩ = ⟨X,Y⟩ = ⟨Y,X⟩ ⟨X,0⟩ = 0 ⟨X∔Z,Y⟩ = ⟨X,Z∔Y⟩−⟨X,Z⟩+⟨Z,Y⟩ ⟨X,X⟩=0 ⇒ X≡0,\rm if the variation is scale invariant ⟨μX,μY⟩ =

If and are independent, then they are uncorrelated.

2.6 Semi-metric between random vectors

The regression problem from classic PCA as given in (22) below is formulated using the squared distance, which is not a norm, but precisely fits the setting of a semi-metric measures the gap between two random variables .

A semi-metric is given by the following three conditions

 ρ(X,Y) ≥0 \rm(positivity), (12) ρ(X,X) =0 \rm(identity), (13) ρ(X,Y) =ρ(Y,X) \rm(symmetry). (14)

With respect to the PCA we require further that is continuous and

 ρ(2∑i=1νiXi,2∑i=1ξiXi) = 2∑i=1ρ(νiXi,ξiXi),\rm for% νi,ξi∈R\rm and X1,X2\rm indep. (15) ρ(X,Y) = ρ(U,V)\rm for X≡U\rm and Y≡V\rm(a.s.). (16)

Proposition 3.2 in Berg et al. (1984) deals with the generalization of a squared difference of real values towards complex values, in the framework of Hilbert spaces. The next definition carries over the implicit idea given there.

Definition 2.30.

Let be a multivariate model, giben by (11) and . For random vectors and such that their distibutions and that of are in , let

Then is called the associated semi-metric.

Lemma 2.31.

Let be a multivariate model, a variation with , and be the associated semi-metric. Let and be random vectors such that their distributions and that of are in . Then, the following assertions hold:

 ρ(X,X) = 0 (17) ρ(X,0) = (18) ρ(X,Y) = (19) ρ(X,Y) = (20) ρ(μX,νX) > (21)

Furthermore, Equation (15) holds. Now assume that the variation a vector is the sum of the variation of the components. Then, is well-defined on if the representation of is unique up to reordering of the summands. If and then is well-defined if the representation is unique up to orthonormal transformations with i.e., .

Proof.

Equalities (17)-(19) obviously hold. Inequality (21) holds since implies and then

 = |μ|α2α(1+|ξ|α)−2(1+|ξ|α)2α−2

with . The right hand side takes its unique minimum at , which is due to (17). Now, let and . For any orthonormal matrix we have . Denote by the sum of the variation of all components of a matrix . Then,

Note that, by Maxwell’s theorem (Kallenberg, 2001, Proposition 12.2), holds for all orthonormal if and only if the are centered Gaussian.

3 Generalizing the classic PCA

In our model-based approach, the PCA is seen as an approximation of a random vector with known distribution by some other random vector with a simpler structure. The function that maps a realization of to a realization of is a projection in classic PCA. This function is called a reconstruction function here. We call a PCA inferable, if the existence and the knowledge of the reconstruction function is guaranteed. We start with reviewing the classic PCA.

3.1 Classic PCA and Autoencoders

Classic PCA is usually understood as reducing the complexity of data in an optimal way with respect to the mean squared error. In general, the data is assumed to be an i.i.d. sample of . Classic PCA is based on the solution of (Pearson, 1901)

 minH:\rm rankH≤pE∥X−HX∥22. (22)

It can readily be seen, that is a solution to the minimization problem, where is the matrix of the first eigenvectors (22). In particular, is a projection matrix, thus symmetric (Baldi and Hornik, 1989). In statistical literature, often is replaced by in (22), additionally assuming that is orthonormal. To enforce uniqueness of the solution

in the general case, an ordering of the corresponding eigenvalues is further assumed.

This problem can be generalized as follows. Let be a measurable loss function and an arbitrary parameter space with elements used to parametrize two measurable functions and . Then, the autoencoder problem is given as

 min(θ1,θ2)∈ΘE[L(X,fθ1∘gθ2(X))]. (23)

Under mild assumptions, the existence of a solution is guaranteed.

Theorem 3.1.

Let be a random variable and a compact metric space for the parameter of the reconstruction functions . Let be a loss function that is bounded from below by and continuous for any function of one fixed argument. For all , the map be continuous. If for all it holds that

 ∥L(X,rθ(X))∥L∞(Ω,A,P):=esssup∣∣L(X,rθ(X))∣∣<∞,

then a solution to the autoencoder regression problem

 minθ∈ΘE[L(X,rθ(X))]. (24)

exists.

Proof.

The function has by assumption compact preimage, thus it suffices to show that it is continuous. For arbitrary and any sequence with limit , we get by dominated convergence

 limn→∞E[L(X,rθn(X))]=E[limn→∞L(X,rθn(X))]=E[L(X,rθ(X))].

This means that under reasonable choices of the statistical model and loss function we always have a solution to the autoencoder problem. We will go further in our approach and consider also

 minθ∈ΘE[L(X,Yθ)]

for certain classes of random variables .

3.2 Generalized PCA

Since the Hilbert space structure is given up here, various generalizations of the classic PCA are thinkable. We give four variants, which we consider particularly interesting. Two notions directly correspond to variable selection procedures in linear regression analysis. The following definition is based on a general semi-metric, although we have the associated semi-metric in mind, since there is no proved evidence that our suggested semi-metric should be preferred. The following definition allows that the PCA does not have a solution.

Definition 3.2.

Let be a linear multivariate model and be a semi-metric such that (12)-(16) hold. Let be given as in Definition 2.14. Let , and for . For some closed and some subset , the -variate B-I PCA is defined as

 \rm PCAp(X)=argminb∈Binfξ∈Dn(b)∩I(X)ρ(X,ξW). (25)

The PCA is called

1. exhaustive if .

2. forward if

 B={(b1,…,bp)∈Rd×p:(b1,…,bp−1)∈\rm PCAp−1(X)}. (26)
3. unrestricted if .

4. (linearly) inferable if

 I(X)=I(ν1,…,νn)⊂{(Hν1,…,Hνn):H\rm a (∔\rm-linear) map Rd→Rd}. (27)

A set of vectors that is a solution to the -variate PCA is called a set of first principal vectors for . Let and be corresponding sets of principal vectors. If and , then the set of vectors is called a set of first principal vectors for .

Remark 3.3.

Condition (26) ensures that the principal vectors in forward PCA are in decreasing order of importance. If these principal vectors are orthogonal in a certain sense, they might be called eigenvectors. For instance, two vectors might be called orthogonal, if

 ⟨μX,νX⟩=0\rm for all F∈F\rm and X∼F. (28)

Note that this is in general stronger than requiring .
In the gaussian case, the vectors and are orthogonal in the sense of (28), if and only if they are orthogonal in the Euclidean sense. In the -variate Gaussian case with , two matrices are orthogonal if and only if , i.e., if the rows of are all ortogonal to the rows of .

Remark 3.4.

In some cases, it is sufficient to consider only in the definition of a multivariate distribution, e.g., when adding independent variables is not reasonable. Then, Definition 3.2 still applies, if the operator and all conditions built on it are ignored.

Example 3.5.

(Continuation of the linear regression model, Example 2.12) Let , and be defined as there, , , , and

 Sdℓ = {(A,…,A)⊤∈Pd:A∈Sℓ} S = k⋃ℓ=m+1Sdℓ, B = Sp,

Then the exhaustive searches the best subset selection with up to predictor variables for a linear regression model with a -variate dependent variable, predictor variables, and error variables. The forward PCA performs the forward selection.
Let us now consider some underlying structure of the variable selection. Let and with and , otherwise. Then, both equalities and are not solvable for . We say that is not strictly preordered. Since Theorem A.7 of the Appendix is rather tight in its assumptions, which include strict preordering, we may expect that even the one-dimensional semimodule possesses subsemimodules with range larger than . This is indeed the case, as for , and . Then, has rank 2, for instance. Assume, that two matrices are orthogonal in the sense of (28). Then, it follows that either one of the corresponding linear regression models (i.e. the whole first line of the matrix) is identically , or both linear models are deterministic, i.e., , or both models are trivial, i.e., . Hence, we will not be able to orthogonalize the vectors that span the subspaces of the exhaustive PCA and the forward PCA. Therefore, we may not expect that forward PCA and exhaustive PCA will be the same, cf. Theorem 3.10 below.

Example 3.6.

(Continuation of the linear regression model, Example 2.12) Oher forms of variable selection are possible. For instance, let the subsemiring be given by the matrices

 A=(AσAβ0(k−1)×1AX),

where is any matrix. For the consider any matrix such that and the last lines of are all zero. Further, the matrix shall have the same rank as . This approach balances out a good fit of the dependent variable with a good fit of the predictor variables. Therefore, we might consider it as a “PCA variable selection”. One extreme situation is that the variation puts nearly no weight on the dependent variable. Then, we end up primarily with a PCA for the predictor variables, in other words, with the PCA regression (Jolliffe, 2002). On the other hand, if the variation puts no weight on the predictor variables (condoning that Condition (7) is violated), already delivers exactly the standard regression.

Remark 3.7.

For the linearly inferable PCA, matrices might be considered, whose so-called Barvinok rank is at most , i.e., the set  in Defintion 3.2 is given by means of all matrices of the form with . Then, we may reformulate the exhaustive, linearly inferable as

 \rm PCAp(X)=argminH1,H2∈Rd×pρ(X,H1H⊤2X).

Hence, the optimization problem becomes a single -dimensional problem. A further advantage is that this choice follows closely the autoencoder idea. A disadvantage is, that the Barvinok rank is rather restrictive, cf. Appendix A.2.

Remark 3.8.

If the variation is scale invariant and is the associated semi-metric, then the exhaustive, unrestricted PCA reads

with

That is,

Remark 3.9.

Except for the linearly inferable PCA, the requirement of the linearity of the multivariate model seems to be excessive, since only the univariate margins of enter in the associated semi-metric.

3.3 Coincidence of variants

Since the four variants of a generalized PCA coincide in the Gaussian case, we consider here general conditions for a coincidence in some exemplary cases.

Theorem 3.10.

Let the conditions of Definition 3.2 hold with the associated semi-metric and scale invariant variation. Assume that for any subsemimodules and , vectors exists such that . Assume that for any subsemimodule and a vector exists with the following two properties:

1. For all and a value and a exists such that, for all that are orthogonal to in the sense of (28), we have

 ν¨+η¨+ξπ = ζ¨+η¨+θπ (29) ν¨+η\rm and θπ are \rm orthogonal. (30)

Then the unrestricted, forward PCA coincides with the unrestricted, exhaustive PCA.
If equality holds in (27) and has always a representation of the form with and , then the linearly inferable, forward PCA coincides with the linearly inferable, unrestricted PCA.

Proof.

Condition (29) ensures that a sequence of principal vectors can be replaced by a sequence of pairwise orthogonal vectors , so that for all . Let