The dimensionality reduction
is an important problem in data science that has many facets and non-equivalent formulations coming from different contexts, either purely mathematical or appearing in applications. The classical one was first formulated in the work of R. Fisher and currently known as
. Subsequently, the idea of principal components was applied to more general frameworks, giving birth to new branches of statistics/machine learning such as the manifold learning (e.g. the nonlinear dimensionality reduction) and the sufficient dimension reduction. In the manifold learning formulation (which is the direct generalization of the classical) we are usually given a finite number of points in(sampled according to some unknown distribution) and our goal is to find a “low-dimensional” geometric structure that approximates “the support” of the distribution and satisfies some additional properties such as smoothness, low complexity etc.
Unlike the latter formulations, in the sufficient dimension reduction (sometimes called the supervised dimension reduction), we are given a finite number of pairs
, also generated according to some unknown joint distribution, and our goal is to find vectors (where ) such that symbolically:
The latter means that an output is conditionally independent of , given . Or, that conditional distribution is the same as .
Of course, the latter formulation can hardly be solved if we do not make any assumptions on the joint distribution, or more specifically on the conditional distribution . A standard assumption is the following semi-parametric discriminative model:
where is a Gaussian noise with and . The function is an unknown smooth function. Then, the function is called the regression function.
There are 3 major methods to estimate parameters of model1: (1) sliced inverse regression ,; (2) methods based on an analysis of gradient and Hessian of the regression function , , 
; (3) methods based on combining local classifiers, .
Probably, the closest to ours is the second approach. Let us briefly outline its idea for . According to that approach we first recover the regression function and estimate the distribution from our dataand estimating the parameters
of the multivariate normal distribution. At the second stage we no longer need our data and treatas the ground truth. Since, for recovered it is natural to expect that , then a natural way to reconstruct vectors is to set them equal to first principal components of the matrix , where is a Hessian matrix of at point .
In our paper we also assume that is an already given ground truth, though unlike the previous approach, we formulate the main problem optimizationally, i.e. our goal is to find
It is easy to see that the latter corresponds to the maximum likelihood approach to estimating of the parameters . Since is an infinite-dimensional object, we analyse it by the tools of functional analysis, specifically using a theory of tempered distributions. The key observation of our analysis, stated in theorem 3.3 of section 3, is that a class of functions of the form can be characterized as those functions whose Fourier transform is supported in a -dimensional linear subspace. Instead of optimizing over generalized functions with a -dimensional support, we suggest minimizing over ordinary functions given in a generic form but with an additional constraint. In order to force their support to be -dimensional, in section 4 we introduce a class of penalty functions such that large values of indicate a strong distortion of the support from any -dimensional linear subspace. For a specific case of , in section 5 we develop an algorithm for our problem that can be formulated for functions given in the frequency coordinate form as well as in the initial coordinate form. The last section is dedicated to experiments on synthetic data.
Throughout the paper we will use common terminology and notations from functional analysis. The Schwartz space of functions, denoted , is a space of infinitely differentiable functions such that , and equipped with a standard topology, which is complete and metrizable. A cartesian power is a set of vector-valued functions, i.e. if and only if .
By the tempered distribution we understand an element from the dual space, . The Fourier and inverse Fourier transforms are first defined as operators by:
and then extended to continuous bijective linear operators by the rule: . The Fourier transform can be applied component-wise to objects from the cartesian power which we will also call the tempered distributions.
If a function is such that for any then it induces a tuple , where .
For a measure , by we denote the Hilbert space of functions from to , square-integrable w.r.t , with the inner product: . The induced norm is then . A space (i.e. when is Lebesgue measure) can be embedded into , i.e. , where corresponds to a tempered distribution . Therefore, Fourier transform can be defined on and we will use the fact that is a unitary operator.
For the convolution is defined as . For , the convolution is defined as a tempered distribution such that:
where and the multiplication is defined by:
Both operations can be extended to the case when by applying them to every component of .
A set of infinitely differentiable functions with a compact support in is denoted as . The Sobolev -norm on for is defined as . The Sobolev space is a the completion of w.r.t. the norm .
For a matrix the Frobenius norm is .
3 Problem formulation
be a probability density function such that. The probability density function defines the Hilbert space , i.e. where . We are also given a real-valued function from which can be given in an arbitrary form, keeping in mind the case of
defined by a feed-forward neural network. Our goal is to approximatein the following form (for fixed in advance):
where is an arbitrary function from and .
For , we have and .
Proof (Proof of theorem 3.1)
It is enough to prove the theorem for . W.l.o.g. we can assume that are linearly independent. If they are linearly dependent and, e.g. , then we define . It is easy to see that and we reduced to the case of theorem for .
is an invertible matrix, then. Indeed, if we denote and , then:
and after opening all the brackets we will obtain a finite sum of expressions of the kind that is bounded. In fact, we proved that Schwartz class is invariant under invertible linear change of variables.
Thus, if we complete with to form a basis in , and make the change of variables , then from we obtain a function . It remains to prove that this function is also in .
For any the expression will be a sum if terms each of them being bounded.
Eventually, we note that and therefore .
If we choose the squared error as the loss function, then we come to the following optimizational problem:
The problem is non-convex and the minimum is taken over infinite-dimensional object. Let us reveal the structure of the objective:
We can apply Fourier transform to our functions, taking into account that Fourier transform is unitary on .
Let us denote . The following statement is an application of the convolution theorem to our case:
If and , then and
Proof (Proof of theorem 3.2)
W.l.o.g. we again assume that . For we have:
I.e. . Unfortunately, is not a rapidly decreasing function, because , in general, defines a nonempty affine subspace and ’s value on the whole subspace will be constant . Therefore, the Fourier transform of is not necessarily an ordinary function.
is a continuous operator (i.e. a tempered distribution), therefore is also a tempered distribution.
By definition . Let us prove that
Since , there exists a sequence of functions , such that
The latter follows from the well-known fact that is dense in .
It is easy to see that
because we can set in the former expression.
The convolution theorem states that for any 2 functions we have:
Since is a continuous operator, then and in . In order to obtain the needed result it remains to show that the convolution operator , is also continuous.
By definition where . I.e. we have to show that if
The latter is obvious if we can set in the former expression. Thus, theorem proved.
The basic phenomenon behind our approach to optimization of (3) is the following statement:
A function can be represented as if and only if there is an orthonormal basis such that:
where – Dirac’s delta function. Moreover, .
Proof (Sketch of the proof of theorem 3.3)
W.l.o.g. we can assume that and are linearly independent. A rigorous proof of the theorem would require a carefull checking of certain integral identitites. Instead we will present a sketch of the proof at the level of strictness common to theoretical physics papers.
() We also can assume that are orthonormal. Indeed, after every redefinition of given by the rule we get the same function if we simultaneously transform to . By making such redefinitions, we can always orthogonolize by Gramm-Schmidt process with a subsequent scaling of ’s arguments.
Let us complete with to form an orthonormal basis in and set:
Then in the Fourier transform formula we will make the change of variables , , :
where . Here we used that . Thus, we obtain the needed representation.
() Suppose that:
Using inverse Fourier transform we get:
After the change of variables , where
Substantively, the theorem claims that if the function’s value depends only on the projection of an argument on , then frequencies from the spectrum of such function are all in .
A set of tempered distributions of the form (4) is denoted as and called a set of functions with -dimensional support.
Thus, our problem becomes equivalent to:
For simplicity of our notation, let us use and interchangeably (from the context it is always clear what we mean). Thus, our problem is:
Note that if we would restrict to be any ordinary function, the latter problem is known in the theory of inverse problems. E.g., in a case when , a problem of finding such that is known as the deconvolution of gaussian kernel, and has many applications in mathematical physics , , . But with our type of restriction, besides that we cannot guarantee that the minimum is attainable on a function from , the set itself does not suit as a good optimization space as it lacks obvious metrics, completeness properties etc.
Instead of minimization over tempered distributions we will relax the property that the support of the function is strictly -dimensional, reducing the problem to optimization over ordinary functions:
where is a penalty term that penalizes if “the dimensionality of its support is greater than ”. In the next section we describe one natural approach to construct such a penalty term .
4 Penalty function
Let be a continuous function such that and . Let us consider a set of functions:
We believe that practically the most interesting case is . Since , we will correspond to the finite measure function (induced by the density ):
on the -algebra of Lebesgue measurable sets. Any finite measure induces the probability measure via the normalization: . We will call a finite measure on a -dimensional measure if there is a -dimensional linear subspace such that .
In the previous section we proved that our problem (3) can be reduced to optimization task (5) over functions with -dimensional support. As we have already pointed out, (as well as ) lacks standard metrics on it, so we need to devise a certain way to measure a distance from an ordinary function to a set . If is an ordinary function, then its support cannot be strictly -dimensional. It is natural to define a distance till as , for a proper distance function on measures. It turns out that -dimensional measures can be characterized in a very simple way:
Let be a finite measure on such that . The measure is -dimensional if and only if
Let be i.i.d. random vectors sampled according to and
. A natural estimator for the matrix of second momentsis:
This estimator is consistent, i.e.:
If we denote , then the latter can be shown after analysis of: . Indeed,
are i.i.d. random variables with finite second moment
. Therefore, by weak law of large numbers:
I.e and therefore:
() Now suppose that
. I.e. we can find orthonormal vectorssuch that . Since , then