1 Introduction
Principal Component Analysis (PCA), introduced by Pearson (1901), has been one of the most commonly used statistical methods for reducing the complexity in datasets: for samples with a high number of features
, a linear subspace is chosen in favor of a simpler representation of the data. Under the assumption of existing variances, these applications have been justified by many theoretical results and have been applied successfully in a wide variety of scientific fields. Overviews can be found in
(Hastie, 2009; Jolliffe, 2002; Ringnér, 2008), for instance. Some special care is needed when the distribution of the data deviates considerably from the multivariate normal distribution. In particular, probability laws without second moments or counting distributions may lack theoretical justification or a good interpretation when PCA is applied
(Jolliffe, 2002).Several generalizations have been considered (Bengio et al., 2013; Candès et al., 2011; Hofmann et al., 2008; Silverman, 1996; Vidal et al., 2016)
. A particularly flexible one is the autoencoder, a neural network architecture that has been considered both in mathematics and statistical learning
(Baldi and Hornik, 1989; Jolliffe, 2002; Oja and Karhunen, 1985). The autoencoder is based on the formulation of a regression problem, where the data is reconstructed by a map that simplifies the stucture in some sense. Often, simplification is meant in terms of mapping into a lower dimensional space and then back to the original space of the data. Here, the admissible maps are possibly nonlinear and chosen such that a certain loss function is minimized. Classic PCA appears as a special case of choosing linear maps.
Algebra plays a dominant role in founding certain areas of probability theory, in particular stochastic processes
(Sasvári, 2005; Schilling et al., 2012; Strokorb and Schlather, 2015). When special problems are adressed, modern algebra has also found applications in statistics, for instance, in design of experiments (Bailey, 2004) or in learning theory (Watanabe, 2009). Surprisingly, when data themselves are considered, ‘linearity’ usually refers to vector spaces over the field of real numbers, although many random variables exhibit natural linearity with respect to other operations
(Golan, 1999; P. Prakash, 1974). An exception in extreme value theory is Gissibl et al. (2021) who refer to the tropical algebra, however in a different context than PCA. Distributions with a stability to some given algebraic operation typically stem from limit laws and hence are infinitely divisible, again not necessarily with respect to the field of real numbers (Davydov et al., 2008). In our algebraic approach, the classic PCA reappears as the Gaussian case.A particular class of distributions without guaranteed second moments, which exhibits linearity with respect to maxima instead of addition, are the maxstable distributions (L. de Haan, 2006; Resnick, 1987; Stoev and Taqqu, 2005)
. In contrast to the multivariate Gaussian distribution, the dependence structure of maxstable distributions cannot be fully described by bivariate characteristics like the covariance
(Beirlant et al., 2004; Strokorb and Schlather, 2015). Hence, decomposing any such derived matrix cannot be sufficient, at least from a theoretical point of view (Jiang et al., 2020).We consider here also intrinsically vectorvalued data, which appear in colour coding, for instance. Special cases there of are matrixvalued data, which appear in a single measurement, for example in functional magnetic resonance imaging (Wang et al., 2016)
. Linear regression models can also be seen as a special case of vectorvalued data. Both variable selection and classic PCA are dimension reducing methods. The ideas are frequently combined, leading to the PCA regression analysis or the sparse PCA, for instance
(Hastie, 2009). Nonetheless, variable selection and PCA have been considered as different methods.Central part of the paper is the definition of the generalized PCA in Section 3. It is based on generalizations of several wellknown notions, such as stable distributions and quadratic variation (Section 2). An important specification of our approach is the PCA for extreme values (Section 4). Some background information is given in the appendix.
2 Foundations
Since classic PCA minimizes the mean square of the residuals (Pearson, 1901), calculating the difference between random variables is implicitly required. In our generalization towards extreme values with Fréchet margins, we replace the abelian group by the semigroup where . Since the calculation of a difference is impossible in a semigroup context, we provide a workaround for the mean square of the residuals, here. First, we have to declare for which random vectors we have a workaround (Section 2.3). Essentially, these vectors have a stable distribution (Section 2.2). In Subsection 2.6, we define a convenient distance between random vectors, which avoids the calculation of residuals. This semimetric is based on a semiscalar product (Subsection 2.5), which itself is based on a kind of valuation principle (Subsection 2.4). The latter is fundamental, since it (i) generalizes the quadratic variation, (ii) is unique in important cases and (iii) throws new light on the variance.
2.1 Semigroups and Semirings
Since semigroups are not that frequently used in a statistical context, we repeat basic notions. See Golan (1999) for a general introduction, for instance. Throughout the paper, we will use , ,, and for the binary operator of the standard addition, the maximum, a general semigroups and a general semiring, respectively. The corresponding multioperators are denoted by , , and .
Definition 2.1.
Let be a nonempty set and be an associative operation, then the tuple is called a semigroup. A semigroup is called

a monoid with identity element , if an element exists, such that for all .

commutative, if for all .

topological, if the set has a topology , is a topological space and the map is continuous.
Definition 2.2.
A set with addition and multiplication is called a semiring if

is a commutative monoid with identity element ,

is a monoid with identity element ,

multiplication is left and right distributive, i.e., and

for all .
Example 2.3.
Examples of practically relevant semirings are, for instance, with , , , and , where denotes the matrix multiplication. If is a nonempty set, then is also a semiring. Last but not least, the quaternion number system is a semiring, which is used in certain areas of physics, see Menanno and Mazzotti (2012), for instance.
Essentially, the definition of a semiring means that the inverses with respect to addition and multiplication are missing and that the multiplication is not necessarily commutative. However, we will always assume that is commutative, and have at least two elements, and that and are topological. Since the focus of the paper is on algebraic aspects, we assume for ease that both, and , are Polish. Further, we will drop the multiplication sign in formulae whenever possible. On the other hand, especially when several different semigroups or semirings are involved in a formula, we may clarify neutral elements and operators with indices.
In our setup, the “scalar” random variable takes values in a monoid . Much more structure will be imposed on the set , which indexes the distributions and which will be a semiring. Primarily, it is this index set that is extended to higher dimensions, the socalled semimodule.
Definition 2.4.
Let be a commutative semigroup and a semiring. Let be a mapping that satisfies for all and the following properties:
Then is called a semimodule over . A subset that obeys the above conditions is called a subsemimodule. If is a topological semigroup and is continuous, the semimodule is called topological. We write and instead of , if is canonic, e.g., if for some .
Definition 2.5.
Let be a semimodule and . Let be the cardinality of . The value
is called the rank of .
Note that the span in the preceding definition is calculated according to the semimodule operations. Linear maps will play an essential role for the reconstruction of the points.
Definition 2.6.
Let and be topological semimodules over the same semiring , with the commutative semigroups and . A map satisfying the conditions
is called linear. If and are both canonical, then is called linear.
Remark 2.7.
If and , a linear map can always be represented by a matrix such that
Although the definitions above are in analogy to the definitions of a vector space and a linear mapping, the consequences of the transition from groups to semigroups are severe. For instance, the dimension of a subspace of a finite dimensional space does not necessarily exist. Appendix A gives some implications that are particularly important when dealing with extreme values. Appendix A also delivers implicitly arguments, why a constructive approach via explicit multivariate distributions is chosen to define the generalized PCA, and not via an abstract formulation based on subsemimodules or on the rank of a matrix.
2.2 Stable distributions
For a general approach to PCA without existing variance we need a generalized notion of stable distibutions, where we replace the standard addition by an arbitrary semigroup operation. Some additional care is needed with respect to the scaling properties of random variables. The following definitions provide the structure to develop a useful theory.
Definition 2.8.
Let be a topological monoid and a semiring with an additional binary operation . Let be a set of distributions over and , , measurable maps, such that
(1)  
(2) 
Then, the set is called a stable set of distributions. We write briefly instead of .
The following definition ensures that transformations of random vectors have still the required distribution, see Proposition 2.15 below.
Definition 2.9.
Let be stable set of distributions where all , , are linear, then is called linear.
Example 2.10.
In case of symmetric stable distributions , , we have for that
Hence, the set of symmetric stable distributions is stable and linear. Here, the Gaussian case is included as , see Samorodnitsky and Taqqu (1994).
Example 2.11.
Matrixvalued data can be considered as vectorvalued with special constraints, which are modelled as a subsemiring of the semiring of matrices. An example of such a subsemiring is a set of block diagonal matrices with fixed block structure. Let us consider here vectorvalued data with values in , . Let
where is the standard matrix multiplication. Then is an abelian group and is a noncommutative ring. We identify with the set of linear maps, i.e.
so that
is the identity matrix, for instance. The
, , are not distinct, since for any we haveFinally, denote by a (through some alphabetic ordering uniquely defined) square root of a positive semidefinite matrix . Then, for and two independent random matrices we have
Thus, the set of variate Gaussian distributions is stable and linear.
Example 2.12.
Let for some as in the previous example. We switch to the standard notation. Let be the subsemiring of matrices of the form
where , , , denotes the unity matrix, and shall imply and . Let for
Then, for , the sets and are subsemirings of . We interprete this setup as a framework for linear regression models. Let
be the predictor variables which are typically assumed to be independent of the error
. Since we aim to show later on, that variable selection in linear modelling is a special case of our PCA, we assume that has any multivariate Gaussian distribution. Let . Then,equals the standard deviation of the error if
. The first component of equals the dependent variable , i.e.,(3) 
The set corresponds to a simple linear regression model based on , for . The set , , denotes the models, where only the predictor variables , , are considered. In our example, the intercept, which is crucial in practice, is missing, for ease of theoretical reasoning. The family of distributions corresponding to the , , is a stable set, if consists of independent Gaussian components. I.e., it can be shown that one of the roots is in if and are. Unfortunately, this is a trivial case for variable selection. For a general distribution of , a representation of the linear model as a stable family is unkown. Fortunately, itself is rich enough so that the PCA can be applied, cf. Example 3.5.
Definition 2.13.
Let be a linear, stable set of distributions. If allows the division by in the sense that
for some and any , then is called a set of infinitely divisible distributions.
2.3 Multivariate distributions
Stable sets of multivariate distributions are already covered by Definition 2.8. Here, we consider an alternative, constructive definition for a multivariate version of a stable distribution that is tailored for our generalized PCA and avoids existence problems. Recall that is an abbreviation for .
Definition 2.14.
Let a stable set of univariate distributions and . Let be the set of distributions given by
(4) 
for all
independent random variables , , and . We write
with . Let be the weak closure of . Then, an element of is called a variate distribution.
We call a multivariate model. It is called linear if is linear.
Definition 2.14 ensures that the univariate margins of are in . A linear multivariate model ensures that the variate margins of are in .
Proposition 2.15.
Let be a linear multivariate model, , , and
Then
(5) 
Proof.
Remark 2.16.
Both Definition 2.8 and Definition 2.14 can be generalized slightly, replacing by , which is then applied to random vectors , only. Then, Equation (4) is rewritten as
Assume and all are
continuous and distinct, i.e., for .
Then is a possible choice.
Here, denotes the pseudoinverse of .
In practice, monotonously increasing maps are
preferred so that the are essentially unique.
Hence, the generalized map suggests that our approach
so far is
essentially restricted to continuous distributions.
The set of Gamma distributions
with fixed scale parameter and arbitrary nonnegative shape parameters obeys this generalized framework, but fails to be a stable set.2.4 Variation
In classic PCA the mean square of the residuals is minimized. From a modelbased perspective, this refers to minimizing the variance of the residuals. In our general approach, the existence of the variance is not guaranteed, so that we have to consider a general function that attaches value to a residual. We wish to minimize the sum of these attached values, but face the additional difficulty that the calculation of the residuals would need additive inverses.
We call the function that attaches values to a random variable a variation, in generalization to the quadratic variation of a Wiener process. Due to property (9) below, it might be interpreted as the number of underlying independent variables.
Definition 2.17.
Let be a stable set of distributions. A continuous map is called a variation, if the following conditions hold
(consistent)  (6)  
(positive)  (7)  
(degenerate element)  (8)  
(additive).  (9) 
We call a variation scale invariant if, additionally,
(10) 
We also write and for and .
Remark 2.18.

If is scale invariant, we have , so that rescaling of all components with the same value will not change the outcome of a PCA, provided the sets and in Definition 3.2 are also scale invariant.

The function is negative definite, as is any function of the form . Furthermore, is a semicharacter on with the identity as involution (Berg et al., 1984).
Proposition 2.19.
If is scale invariant, then the following properties hold:

for the neutral element of .

is division free, i.e., for all with we have or .

Let be a nontrivial interval with standard topology. Then, for some unique .
Proof.

Equality (10) yields . The positivity of the variation excludes . Hence, .

implies that or . The positivity of the variation yields or .

Let . The function , is well defined on some nontrivial interval that includes and is continuous there. Since obeys Cauchy’s functional equation we get for and some . Now, assume that . Then, Cauchy’s functional equation delivers that for and some . For we have , so that . Hence . The additivity yields with . Assume . Then the continuity of the variation yields in contradiction to (9). Now, assume that and are two scale invariant variations with . Then, for all ,
so that .
Example 2.20.
In the symmetric stable case, the socalled covariation norm assigns the parameter to for , i.e., . It follows immediately from the properties of , see Samorodnitsky and Taqqu (1994), that satisfies the four properties of a scale invariant variation, that is, . For centered Gaussian variables with , the variation equals indeed the variance.
Example 2.21.
In case of the the stable set of variate Gaussian distributions, see Example 2.11, the variation might be defined as
Then, is division free if and only if .
2.5 Semiscalar product
Property (9) of the variation gives reason to generalize the notion “uncorrelated” to random variables without existing variance. The following definition is tailormade for scaleinvariant variations.
Definition 2.22.
Let be a multivariate model, a variation and . Let and be random vectors such that their distribution and that of are in . For let be an extension of the variation to a vector . Then,
is called the semiscalar product between and . The vectors and are called uncorrelated (positively / negatively correlated) if ( respectively ).
Remark 2.23.
Example 2.24.
Let be a standard Gaussian random variable and the variation of a vector be the sum of the variation of the components. Then the random vectors and are jointly multivariate Gaussian, fully dependent, but uncorrelated according to Definition 2.22. Note that the standard notion of “uncorrelated” is defined only for scalar random variables. The generalized definition still implies that two jointly bivariate, scalar Gaussian random variables are uncorrelated if and only if they are independent.
Example 2.25.
In the maxstable case the operator is the maximum, so that and hence . In the case of stable distributions, however, the case leads to
, so that in particular the Cauchy distribution needs its own theoretical treatment or, at least, some limit considerations.
Remark 2.26.
Definition 2.22 suggests the interpretation that two random quantities are called uncorrelated if they behave as if they were independent. This behaviour has been made precise in terms of the variation.
Remark 2.27.
Linearity of the multivariate model is not sufficient to have in Definition 2.22. As a wellknown example, consider the set of univariate Gaussian distributions with , . Let and where is a random sign, i.e., . Then, , but the distribution of does not belong to .
Remark 2.28.
For two jointly stable, scalar random variables and with scale parameter and , respectively, the codifference is defined as (Samorodnitsky and Taqqu, 1994)
and measures the difference between two variables. By way of contrast, measures the difference in variation of a sum of two dependent variables and of two independent ones. Formally, .
Lemma 2.29.
Let be a multivariate model, a variation and . Let , and be random vectors such that their distributions and those of , and are in . Let . Then, the following assertions hold:
If and are independent, then they are uncorrelated.
2.6 Semimetric between random vectors
The regression problem from classic PCA as given in (22) below is formulated using the squared distance, which is not a norm, but precisely fits the setting of a semimetric measures the gap between two random variables .
A semimetric is given by the following three conditions
(12)  
(13)  
(14) 
With respect to the PCA we require further that is continuous and
(15)  
(16) 
Proposition 3.2 in Berg et al. (1984) deals with the generalization of a squared difference of real values towards complex values, in the framework of Hilbert spaces. The next definition carries over the implicit idea given there.
Definition 2.30.
Let be a multivariate model, giben by (11) and . For random vectors and such that their distibutions and that of are in , let
Then is called the associated semimetric.
Lemma 2.31.
Let be a multivariate model, a variation with , and be the associated semimetric. Let and be random vectors such that their distributions and that of are in . Then, the following assertions hold:
(17)  
(18)  
(19)  
(20)  
(21) 
Furthermore, Equation (15) holds. Now assume that the variation a vector is the sum of the variation of the components. Then, is welldefined on if the representation of is unique up to reordering of the summands. If and then is welldefined if the representation is unique up to orthonormal transformations with i.e., .
Proof.
Equalities (17)(19) obviously hold. Inequality (21) holds since implies and then
with . The right hand side takes its unique minimum at , which is due to (17). Now, let and . For any orthonormal matrix we have . Denote by the sum of the variation of all components of a matrix . Then,
Note that, by Maxwell’s theorem (Kallenberg, 2001, Proposition 12.2), holds for all orthonormal if and only if the are centered Gaussian.
3 Generalizing the classic PCA
In our modelbased approach, the PCA is seen as an approximation of a random vector with known distribution by some other random vector with a simpler structure. The function that maps a realization of to a realization of is a projection in classic PCA. This function is called a reconstruction function here. We call a PCA inferable, if the existence and the knowledge of the reconstruction function is guaranteed. We start with reviewing the classic PCA.
3.1 Classic PCA and Autoencoders
Classic PCA is usually understood as reducing the complexity of data in an optimal way with respect to the mean squared error. In general, the data is assumed to be an i.i.d. sample of . Classic PCA is based on the solution of (Pearson, 1901)
(22) 
It can readily be seen, that is a solution to the minimization problem, where is the matrix of the first eigenvectors (22). In particular, is a projection matrix, thus symmetric (Baldi and Hornik, 1989). In statistical literature, often is replaced by in (22), additionally assuming that is orthonormal. To enforce uniqueness of the solution
in the general case, an ordering of the corresponding eigenvalues is further assumed.
This problem can be generalized as follows. Let be a measurable loss function and an arbitrary parameter space with elements used to parametrize two measurable functions and . Then, the autoencoder problem is given as
(23) 
Under mild assumptions, the existence of a solution is guaranteed.
Theorem 3.1.
Let be a random variable and a compact metric space for the parameter of the reconstruction functions . Let be a loss function that is bounded from below by and continuous for any function of one fixed argument. For all , the map be continuous. If for all it holds that
then a solution to the autoencoder regression problem
(24) 
exists.
Proof.
The function has by assumption compact preimage, thus it suffices to show that it is continuous. For arbitrary and any sequence with limit , we get by dominated convergence
This means that under reasonable choices of the statistical model and loss function we always have a solution to the autoencoder problem. We will go further in our approach and consider also
for certain classes of random variables .
3.2 Generalized PCA
Since the Hilbert space structure is given up here, various generalizations of the classic PCA are thinkable. We give four variants, which we consider particularly interesting. Two notions directly correspond to variable selection procedures in linear regression analysis. The following definition is based on a general semimetric, although we have the associated semimetric in mind, since there is no proved evidence that our suggested semimetric should be preferred. The following definition allows that the PCA does not have a solution.
Definition 3.2.
Let be a linear multivariate model and be a semimetric such that (12)(16) hold. Let be given as in Definition 2.14. Let , and for . For some closed and some subset , the variate BI PCA is defined as
(25) 
The PCA is called

exhaustive if .

forward if
(26) 
unrestricted if .

(linearly) inferable if
(27)
A set of vectors that is a solution to the variate PCA is called a set of first principal vectors for . Let and be corresponding sets of principal vectors. If and , then the set of vectors is called a set of first principal vectors for .
Remark 3.3.
Condition (26) ensures that the principal vectors in forward PCA are in decreasing order of importance. If these principal vectors are orthogonal in a certain sense, they might be called eigenvectors. For instance, two vectors might be called orthogonal, if
(28) 
Note that this is in general stronger than requiring .
In the gaussian case, the vectors and are orthogonal
in the sense of (28), if and only if
they are orthogonal in the Euclidean sense.
In the variate Gaussian case with , two matrices
are orthogonal if and only if ,
i.e., if the rows of are all ortogonal to the rows of .
Remark 3.4.
In some cases, it is sufficient to consider only in the definition of a multivariate distribution, e.g., when adding independent variables is not reasonable. Then, Definition 3.2 still applies, if the operator and all conditions built on it are ignored.
Example 3.5.
(Continuation of the linear regression model, Example 2.12) Let , and be defined as there, , , , and
Then the exhaustive searches the best subset selection with up
to
predictor variables for a linear regression model with a variate dependent
variable, predictor variables, and error variables.
The forward PCA performs the forward selection.
Let us now consider some underlying structure of the
variable selection.
Let and with and , otherwise.
Then, both equalities and
are not solvable for .
We say that
is not strictly preordered.
Since Theorem A.7 of the Appendix is rather tight in its assumptions, which
include strict preordering, we may expect that even
the onedimensional semimodule possesses subsemimodules with range larger than .
This is indeed the case, as
for , and
. Then, has rank 2, for instance.
Assume, that two matrices are orthogonal in the sense
of (28). Then, it follows that
either one of the corresponding linear regression models (i.e. the
whole first
line of the matrix) is identically ,
or both linear models are
deterministic, i.e., ,
or both models are trivial, i.e., .
Hence, we will not be able to orthogonalize the vectors that span
the subspaces of the exhaustive PCA and the forward PCA.
Therefore, we may not
expect that forward PCA and exhaustive PCA will be the same,
cf. Theorem 3.10 below.
Example 3.6.
(Continuation of the linear regression model, Example 2.12) Oher forms of variable selection are possible. For instance, let the subsemiring be given by the matrices
where is any matrix. For the consider any matrix such that and the last lines of are all zero. Further, the matrix shall have the same rank as . This approach balances out a good fit of the dependent variable with a good fit of the predictor variables. Therefore, we might consider it as a “PCA variable selection”. One extreme situation is that the variation puts nearly no weight on the dependent variable. Then, we end up primarily with a PCA for the predictor variables, in other words, with the PCA regression (Jolliffe, 2002). On the other hand, if the variation puts no weight on the predictor variables (condoning that Condition (7) is violated), already delivers exactly the standard regression.
Remark 3.7.
For the linearly inferable PCA, matrices might be considered, whose socalled Barvinok rank is at most , i.e., the set in Defintion 3.2 is given by means of all matrices of the form with . Then, we may reformulate the exhaustive, linearly inferable as
Hence, the optimization problem becomes a single dimensional problem. A further advantage is that this choice follows closely the autoencoder idea. A disadvantage is, that the Barvinok rank is rather restrictive, cf. Appendix A.2.
Remark 3.8.
If the variation is scale invariant and is the associated semimetric, then the exhaustive, unrestricted PCA reads
with
That is,
Remark 3.9.
Except for the linearly inferable PCA, the requirement of the linearity of the multivariate model seems to be excessive, since only the univariate margins of enter in the associated semimetric.
3.3 Coincidence of variants
Since the four variants of a generalized PCA coincide in the Gaussian case, we consider here general conditions for a coincidence in some exemplary cases.
Theorem 3.10.
Let the conditions of Definition 3.2 hold with the associated semimetric and scale invariant variation. Assume that for any subsemimodules and , vectors exists such that . Assume that for any subsemimodule and a vector exists with the following two properties:


For all and a value and a exists such that, for all that are orthogonal to in the sense of (28), we have
(29) are (30)
Then the unrestricted, forward PCA coincides with the
unrestricted, exhaustive PCA.
If equality holds in (27) and
has always a representation of the form
with and , then
the linearly inferable, forward PCA coincides with the
linearly inferable, unrestricted PCA.
Proof.
Condition (29) ensures that a sequence of principal vectors can be replaced by a sequence of pairwise orthogonal vectors , so that for all . Let
Comments
There are no comments yet.