In the classical dimension reduction problem we are given a list of points sampled according to some unknown distribution and the goal is to find a low-dimensional (-dimensional where ) affine subspace , so that the projection of the points onto preserves key structural information (the latter term makes sense only after we add some assumptions on the process that generated our points, i.e. on
). The most popular method for solving the problem is Principal Component Analysis (PCA)[Fisher(1922)]. In PCA we first assume that
is a multivariate Gaussian distribution, then we estimate from data its expectation and covariance matrix and calculate an affine subspace from the latter. Although PCA is commonly used for highly “non-gaussian” data, it is important to note that the basis of PCA is the “normality assumption”. Alternative methods, e.g. sparce PCA[Zou et al.(2004)Zou, Hastie, and Tibshirani], are akin to the classical PCA in the sense that they assume that a subspace can be found based only on the covariance matrix of data.
An approach that we present in that paper is based on the theory of generalized functions, or tempered distributions [Soboleff(1936), Schwartz(1949)]. An important generalized function that cannot be represented as an ordinary function is the Dirac delta function, denoted , and denotes its -dimensional version.
Since assumptions on can be various, the most natural is to define first the empirical probability density function, i.e. . After that assumption the dimension reduction can be understood as an approximation: , where is a generalized function (that we need to find) whose density is supported in a -dimensional affine subspace . Note that a function whose density is supported in some low-dimensional subset of is not an ordinary function. As it is usually done, we can assume that our initial data were already centralized, so that instead of searching for a function supported in an affine subspace, we will assume that is a linear subspace (i.e. ). If is the basis of , then in a “physics notation” where is an ordinary function. To get an optimizational formulation it remains to add that we are given a loss that measures the distance between our ground truth and a distribution that we search for. In the experimental part of the paper we consider where is a smoothing of a generalized function via the convolution with some function . Thus, in our approach the dimension reduction problem is defined optimizationally as:
Another well-known problem of data analysis is the so-called sufficient dimension reduction problem (sometimes called the supervised dimension reduction), which is tightly connected with the latter problem. There we are given a finite number of pairs
, also generated according to some unknown joint distributionand for we assume that (where ):
where is the gaussian noise and
are unknown vectors andis an unknown smooth function. The latter implies that an output is conditionally independent of , given . Or, that conditional distribution is the same as (and normal). Our goal is to recover vectors and the function .
There are 3 major methods to solve the problem: (1) sliced inverse regression [Li(1991)],[Cook and Weisberg(1991)]; (2) methods based on an analysis of gradient and Hessian of the regression function [Li(1992)], [Xia et al.(2002)Xia, Tong, Li, and Zhu], [Mukherjee and Zhou(2006)]
; (3) methods based on combining local classifiers[Hastie and Tibshirani(1996)], [Sugiyama(2007)].
Let us briefly outline the idea of our approach. According to our approach we first recover the regression function that maps given inputs to corresponding outputs, i.e. , and estimate the distribution from our dataand estimating the parameters
of the multivariate normal distribution.
Let us now externalize the first step and treat as the ground truth. Since, for recovered it is natural to expect that , then a natural way to reconstruct vectors is to set them equal to arguments on which the following minimum is attained:
where is a function that measures the distance between functions and given that their inputs are sampled according to . In the experimental part of the paper we consider the case .
Solving problems 1 and 2 in practice is both a theoretical and an experimental challenge. In both problems, search spaces are infinite-dimensional and do not form a linear space. Moreover, in problem 1 it consists of generalized functions. Our paper is dedicated to developing a framework that tackles both problems.
The structure of the paper is the following. In section 2 we give standard definitions of the tempered distribution and operations that can be applied to such distributions, convolution, the Fourier transform etc. In section 3 we give mathematically precise definitions of the search space of problem 1, denoted , and the search space of problem 2, denoted , and prove that they are dual to each other in the sense that an image of under the Fourier transform is and vice versa. In section 4 we introduce our approach to optimization over (or, ) that is based on the use of the so called proper kernel functions, . Using proper kernels we prove theorem 4 that characterizes generalized as those for which the matrix of properly defined integrals is of rank . The main idea of the section is to define as a set of ordinary functions for which squared Frobenius distance from to some rank matrix is not greater than . I.e. “approximates” in a certain sense. Theorem 4 is a key result of the section that demonstrates that solutions of problems for a sequence , under certain assumptions, can be transformed into a solution of problem . In section 5 we suggest an algorithm for solving which we call the alternating scheme (subsection 5.1). In subsection 5.2 we formulate the alternating scheme in the dual space for a case of the proper kernel . In section 6 we describe our computational experiments with synthetic data.
Throughout the paper we use common terminology and notations from functional analysis. The Schwartz space of functions, denoted , is a space of infinitely differentiable functions such that , and equipped with a standard topology, which is metrizable and complete. The tempered distribution is a continuous linear operator . For , denotes an image of under . The set of all such operators, denoted , is equipped with the weak topology. I.e. for the sequence and , (or ) means that for any . For , denotes the sequential closure of . The Fourier and inverse Fourier transforms are first defined as operators by , , and then extended to continuous bijective linear operators by the rule: . If a function is such that for any , then it induces a linear operator , where . For a measure , by we denote the Hilbert space of functions from to , square-integrable w.r.t. , with the inner product: . The induced norm is then . For the convolution is defined as . For , the convolution is defined as a tempered distribution such that where . If and a function is such that whenever , then the multiplication is defined by . A set of infinitely differentiable functions with compact support in is denoted as . If is a topological space, then a subset is said to be dense in if the sequential closure of is equal to . For a matrix the Frobenius norm is . For brevity, we denote
. Identity matrix of sizeis denoted as .
3 Basic function classes
To formalize distributions supported in a -dimensional subspace, we need a number of standard definitions. For and
their tensor product is the functionsuch that . The span of , denoted , is called the tensor product of and . For and their tensor product is defined by the following rule: for any . Since is dense in , there is only one distribution that satisfies the latter identity.
An example of a generalized function, whose density is concentrated in a -dimensional subspace, is any distribution that can be represented as:
where . If where is an ordinary function, then can be understood as a generalized function whose density is concentrated in a subspace and equals . It can be shown that the distribution acts on in the following way:
Now to generalize the latter definition to any -dimensional subspace we have to introduce the change of variables in tempered distributions.
be an orthogonal matrix, i.e.. Then, is defined by the rule: where . If , then the latter definition gives where . Now let us define classes of tempered distributions:
The latter two classes are dual to each other. Before we will prove that statement, let us comment that formalizes all distributions with a -dimensional support and is defined by a set of ordinary functions . The condition on the matrix in the definition of can be relaxed (by only requiring that it is a full rank matrix), i.e. . Indeed, if , then where and . It is easy to see that and . Thus, is just a set of functions that can be represented as a composition of a linear operator from to and a -ary smooth function. Or, in other words, is a function whose value on depends only on the projection of onto a -dimensional subspace, i.e. the row space of .
and . Let us prove first that if , then
where . For that we have to prove that for any . Indeed,
Let us calculate the image of under Fourier transform. It is easy to see that for any and orthogonal we have: . Therefore, . Thus, if , then where where is a matrix consisting of first rows of . Thus, . It is easy to see that by varying and in the expression we can obtain any function from . Therefore, , and from bijectivity of fourier transform we obtain that .
Let us also define . It is easy to see that . For any collection , denotes , which a linear space over . The set has the following simple characterization: For any , if and only if . [Proof ()] Let us prove that from it follows that .
It is easy to see that if . If , then for we have .
Thus, we have orthogonal vectors, , such that . Using standard linear algebra we get that there are at most distributions that form a basis of .
For a proof of the inverse statement we need the following lemma first. If is such that for any , then . [Proof of lemma] Recall from functional analysis that for the tempered distribution is defined by the condition . Once the Fourier transform is applied, our lemma’s dual version is equivalent to the following formulation: if , then . Let us prove the latter formulation.
Suppose and are chosen in such a way that , . Let us define:
It is easy to see that for any we have (at least one derivative over is present):
The terms and are bounded by definition of . The boundedness of is a consequence of the inequality (which holds because ): .
Analogously (no derivatives over is present):
The second term is 0 when . It is also bounded when because and:
The latter is bounded, since .
The first term is 0 when and it is bounded for :
The latter is also bounded, since .
Thus, is bounded and . Therefore implies:
Since this sequence of arguments can be implemented for any , we can apply them sequentially to initial w.r.t. and will obtain that for any such that :
Moreover, since is dense in , we can assume that . For the inverse Fourier transform the latter condition becomes equivalent to:
for any such that . Let us define . It is easy to check that where for . I.e. and lemma proved. Proof of theorem 2 (). If , then
I.e., there exists at least orthonormal vectors , such that . Therefore, .
Let us complete to form an orthonormal basis of : . Let us define a matrix . It is easy to see that:
Since for we have , then . Using lemma 3 we obtain that . Therefore, . Theorem proved.
4 Optimization over
The central problem that our paper addresses is how to optimize a target function over ? Since is not a complete metric space (it is not even a sequential space [Smolyanov(1992)]), optimization over such spaces needs additional tools. In that section we suggest an approach based on penalty functions and kernels.
Throughout this section we assume that a function