In many situations, high-dimensional observations from a physical system can be modeled as weighted averages of a fixed number of unknown sources whose profiles cannot be assessed, separately. Although the contributing weights of sources for each observation are also unknown, one can still attempt to infer the source profiles through careful analysis of a sufficiently large number of sample observations. This computational problem arises in many areas, such as hyper-spectral remote sensing [li2015minimum, ambikapathi2011chance], statistical mixture analysis, and recognition of tumor heterogeneity in cancer sciences and bioinformatics [tolliver2010robust, satas2017tumor, zuckerman2013self]. In all such cases, unmixing the data refers to learning the unknown sources as well as the contributions of sources (weights) for each observed sample.
From a geometric point of view, this problem can be modeled as the learning of a high-dimensional simplex. A -dimensional simplex is defined as the set of all convex combinations of fixed points in , also called vertices. For example, simplices with and correspond to triangles and tetrahedra, respectively. Learning of a simplex refers to inferring its vertices through observing randomly chosen points from its interior. This problem can be viewed as solving the following set of equations:
where , and is a matrix. Here, s represent the observed points, each column of the unknown matrix represents a source, and s represent the unknown weights (discrete distributions) which generate s from the sources. We also assume that has a bounded condition number, and s are generated independently and distributed identically.
Some interesting questions in this setting would be: How many samples are required to achieve a reliable approximation of the unknown simplex (sample complexity)? Or how well any algorithm can perform if observations become noisy? In this study, we have analyzed the sample complexity of Maximum-Likelihood (ML) estimator for this task, and show that a sufficient sample size for insuring a total-variation error of at mostis only . To the best of our knowledge, this yields a significant improvement over the only existing result, i.e. [anderson2013efficient], where the sample complexity is upper-bounded as . On the other hand, ML algorithm is known to be NP-hard for this problem, thus we have proposed a sub-optimal and relaxed solution which enjoys from similar theoretical guarantees. In this regard, we have also formulated a Gradient Descent (GD)-based optimization scheme for our continuous relaxation that has a theoretically established convergence-rate, and is guaranteed to converge to at least a local minimum of our proposed loss. Experimental results demonstrate an acceptable performance for our method in the noiseless case, while show a considerable superiority compared to other existing methods when data samples are noisy.
1.1 Related Works
Previous works in this domain can be divided into two separate categories: papers that follow a theoretical approach, and those trying to solve a real-world application via heuristics. A seminal work on learning linear transformations, authors in[frieze1996learning] have proved efficient PAC-learnability of hyper-cubes via samples, suggesting a possibly similar result for simplices. More recently, [anderson2013efficient] proves that simplices are efficiently PAC-learnable111In this paper, the term PAC-learnable refers to any learning task whose sample complexity is shown to be polynomial w.r.t. the corresponding parameters, i.e. and . Whenever computational complexity shares this property as well, the term efficiently PAC-learnable is used. with a sample complexity of . The ML estimator in the noiseless setting is equivalent to finding the minimum volume simplex that contains all data points. This task is shown to be NP-hard [packer2002np], irrespective of the input representation which could be either via facets (polytopes) or the vertices. In this regard, [chan2009convex] introduces an LP-relaxation that computes the determinant of a modified vertex matrix instead of the volume. Determinant can then be written as the sum of matrix co-factors and consequently optimized in a row-wise manner. However, authors do not provide any theoretical guarantees while experiments are used for justification of their method. Authors in [nguyen2009learning]
studied a similar problem in lattice-based cryptography, where inputs are generated from a parallelepiped instead of a simplex. Moreover, our problem is also similar to Blind Source Separation (BSS) research, and in particular Independent Component Analysis (ICA). Researchers in the latter case, however, usually employ high-order moments for capturing the source signals[hyvarinen2004independent].
From a more applied perspective, learning of simplices is of practical interest in hyper-spectral imaging [zhang2017robust]. Hyper-spectral cameras capture electromagnetic energy scatters from a scene in multiple wavelengths, where the spectrum of each pixel is a combination of patterns related to the reflection (or radiation) of its basic elements [ambikapathi2011chance, agathos2014robust]. Recently, this field has attracted intensive research interests which are mostly centered on analysis of minimum volume simplex, c.f. [bioucas2012hyperspectral, li2015minimum, ambikapathi2011chance, lin2013endmember]. Another tightly related application w.r.t. learning of simplices is the recognition of common patterns in cancerous tumors [satas2017tumor]. Tumors usually have a high cell-type heterogeneity, and a collective knowledge from the characterstics of each cell-type is vital for an effective treatment. However, biological datasets are mostly in bulk format and therefore each sample is an aggregation of cells from all the cell-types. The idea of finding the smallest inclusive simplex for capturing the hidden profiles of cell-types is exploited in several recent researches in this area [tolliver2010robust, satas2017tumor, zuckerman2013self, schwartz2010applying].
The rest of the paper is organized as follows: Section 2 explains the mathematical notation and then formally defines our problem. Our proposed method together with the achieved theoretical results are presented in Section 3. Section 4 is devoted to experimental results on both synthetic and real-world datasets. Finally, conclusions are made in Section 5.
2 Notation and Definitions
Throughout the paper, we denote the number of mixtures in our model by , for . Without loss of generality, the number of features per sample can be assumed to be 222All data points within a -simplex lie in a -dimensional linear subspace, which is almost surely identifiable as long as .. Let us denote as the set of all discrete -dimensional probability mass functions . is generally referred to as the standard simplex. Let to be a -simplex with the set of vertices , where . Mathematically speaking,
Also, let represent the set of all -simplices in . For a -simplex , denotes the Lebesgue measure (volume) of . For , let to be the th polygonal facet of , i.e. the -simplex obtained by removing the th vertex of . In this regard, we define as the volume of the largest polygonal facet of . In a similar fashion, we denote by , the length of the largest line segment inside (also known as the diameter of ). Alternatively, can be defined as the maximum Euclidean distance between any two points inside .
In many scenarios, hardness in statistical inference of a simplex may depend on its level of geometric regularity. In other words, simplices with more or less equally-sized facets are considerably easier to learn comparing to those including very acute corners. Inspired by the Isoperimetric inequality in high-dimensional geometry [osserman1978isoperimetric], let us define the -isoperimetricity property for a -simplex as:
A -simplex is defined to be -isoperimetric if the following inequalities hold:
The essence of -isoperimetricity property for a -simplex ensures that it is comparably stretched along all the dimensions of . Also, let us define as the total-variation distance between two probability measures.
2.1 Problem Definition
Assume to be i.i.d. samples drawn uniformly from
, i.e. via the following probability distribution:
where is the indicator function of the -simplex which returns if and zero otherwise. The problem is to approximate from data samples in in an -PAC manner, i.e. total variation between the estimate and the actual simplex stays below , with probability at least .
3 Statistical Learning of Simplices
In this section, we first formulate the ML estimator for recovering the unknown -simplex from samples. ML estimation for this task is computationally intensive and cannot be solved via regular continuous optimization tools. In order to overcome this problem, in Section 3.1 a sub-optimal solution will be provided which is guaranteed to PAC learn . Sample complexity for both ML and the proposed sub-optimal solution with corresponding mathematical analyses are given in Section 3.2, while Section 3.3 provides details for solving the proposed optimization problem, numerically.
3.1 ML Estimation and a Sub-optimal Solution
ML estimation of works by finding the -simplex which maximizes the log-probability , which can be written as
Maximization of (1) is already proved to be NP-hard, and thus impractical in real-world situations. Moreover, the minimum required number of samples for this algorithm to work properly has remained unknown. A remedy for the numerical hardness of ML estimation, would be to replace its objective criterion with a more manageable, yet sub-optimal, cost function. Before that, it worth noting that maximization of (1) is equivalent to the following minimization problem:
Motivated by the formulation in (2), let us propose the following minimization problem in order to find an estimation for , which we denote by throughout the paper:
where both and the function can be chosen by the user.
must be a strictly increasing loss function. In Subsection3.3, an efficient technique for computation of derivatives and performing Gradient Descent (GD) algorithm are given which works for almost all reasonable choices of .
Following this replacement, we will then provide extensive mathematical analysis on the PAC-learnability of our proposed method as well as its sample complexity. Surprisingly, techniques used for this aim also shed light on the sample complexity of the ML estimation, which shows significant improvement over previous results.
3.2 PAC-Learnability and Sample Complexity
Following theorems provide a set of sufficient sample complexities for PAC-learnability of both the proposed and ML estimations. The mathematical analyses behind these results include notions from high-dimensional geometry and calculus, as well as strong tools from VC theory in statistical learning.
Theorem 1 (PAC-Learnability from Noiseless Data).
Assume a -simplex with Lebesgue measure , which is -isoperimetric for some . Also, assume to be i.i.d. samples uniformly drawn from . Assume there exists and , for which
where for .
Then, the solution to the optimization problem in (3), denoted by , with probability at least satisfies the inequality
where and represent uniform probability measures over simplices and , respectively.
Proof of Theorem 1 is given in the Appendix. This theorem gives an explicit upper-bound on the minimum number of samples which are sufficient for (3) to work. It should be noted that the preceding theorem work for all legitimate choices of and . Using this fact and by manipulating the proof of Theorem 1, it is possible to achieve a similar sample complexity bound for ML estimation.
Corollary 1 (PAC bounds for ML algorithm).
Assume a -simplex with a non-zero Lebesgue measure . Also, consider to be i.i.d. samples drawn uniformly from . Assume there exists and , for which we have
Then, the maximum likelihood estimator of , denoted by , follows the inequality with probability at least . Again, and represent uniform probability measures over simplices and , respectively.
Proof of Corollary 1 is also given in the Appendix. Surprisingly, geometric regularity of the simplex does not affect the sample complexity required by ML, although it may have a considerable impact on the computational complexity of the optimization algorithms. Analysis of the latter case is beyond the scope of this paper.
3.3 Numerical Optimization
In this section, we will provide a procedure to solve (3), numerically. Achieving the minimizer is simply approached by a Gradient Descent (GD) algorithm on the objective , as described in Algorithm 1. Let us parameterize any -simplex via its vertices, i.e. the matrix . In this regard, one needs to calculate in each iteration.
According to (3), is naturally broken into two separate terms. Computation of the gradient with respect to the last term, i.e. , is straightforward, since
and hence its derivatives can be computed easily.
For the first term in (3), i.e.
, one needs to compute the gradient vectorsfor valid . The remainder of this section is mostly dedicated to computing this expression. Let be the solution of the following quadratic program:
Then, through applying the chain rule we have:
where denotes the derivative of .
Let be desirable with respect to if any infinitesimal perturbation of does not change the support of . It is easy to show that for a fixed , occurrence of an undesirable simplex is almost surely impossible. Then, assuming is desirable, we can compute by investigating the following cases:
: There exists such that any infinitesimal perturbation of does not change the optimum value of the (5), which concludes that .
: It can be shown that is the projection of on the affine hull of . Also, desirability of implies that is in the relative-interior of . These conclude that, equals to the gradient of distance of from a known affine space, which is a quadratic form and can be computed efficiently.
4 Experimental Results
Optimization algorithm described in Section 3 has already been implemented in a software package which is available as a supplementary material. This section is devoted to investigating the performance of the proposed method on synthetic and real-world datasets, as well as comparison with rival frameworks. Throughout this section, the performance is measured as the average Euclidean distance between the vertices of the ground truth and those of the estimate . We call it error which can be written as
where minimization is taken over all permutations of the numbers . Also, all synthetic and real-world datasets have been normalized in all dimensions prior to simulation, thus any scaling effect in has been removed. Also, we have chosen . Parameters and the learning rate in Algorithm 1 are manually adjusted during simulations. Noticeably, we did not observe any considerable sensitivities to these parameters.
4.1 Synthetic Data
Figure 1 shows a number of snapshots of the proposed method while running on a set of nominal synthetic noiseless data points. As it can be seen, despite the fact that algorithm has started from an inappropriate initial condition, it converges to an acceptable solution in a moderate number of iterations.
Figure 1(a) illustrates the performance of our algorithm in the presence of noise. Accordingly,
noisy samples generated from a two-dimensional simplex. The additive noise for each sample is drawn from a zero-mean white Gaussian noise with standard deviation. Let us define the Signal to Noise Ratio (SNR) as , where
is the smallest singular value of matrix. Figure 1(a) represents the performance of our algorithm with respect to different values of SNRs (which can be done by changing the standard deviation of the noise). The result shows that our method performs well in moderate or high SNR scenarios. Figure 1(b) shows the estimation error as a function of dimensionality . Noiseless dataset size has been increased proportionally to , however due to the choice of (instead of the one suggested in proof of Corollary 1), error increases smoothly with respect to .
|Simple Dataset||Noisy Dataset||High Dimension Dataset|
In Table 1, we have compared our method to a number of well-known methods in this area. Datasets used for experiments are 1) Simple Dataset: a simple two-dimensional simplex with data points, 2) Noisy Dataset: a noisy version of the latter with , and 3) High Dimension Dataset: a noiseless simplex with and . Due to the results, our method has a comparable performance both in the simple and high-dimensional cases, while it outperforms all the rival strategies in noisy scenarios. In the next subsections, performance of our method is tested on two real-world applications.
4.2 Computational Biology
Unmixing cell-types from biological organs is becoming an interesting area in bioinformatics. Recently, authors in [shen2010cell] synthetically merged cells from three different tissues (namely brain, liver and kidney) several times and each time with known combination weights. They aimed to perform a subsequent deconvolution of samples via micro-array gene expression screens. Their dataset is a potentially appropriate target for our algorithm.
We have collected samples from the dataset, where the initial dimensionality of data is . Samples have been linearly projected into a corresponding -dimensional subspace, and then the proposed method has been applied. Figure 3 depicts the final results where the gene expression profiles of the initial three tissues have been identified with a significant accuracy. Moreover, the obtained combination weights obtained by our framework highly resemble the ground truth.
4.3 Hyper-spectral Remote Sensing
As mentioned in Section 1, a major application of unmixing falls in the area of hyper-spectral remote sensing where researchers aim to find the material composition profiles of remote areas via hyper-spectral imaging devices. In this regard, Cuprit dataset [fyzhu_2014_TIP_DgS_NMF] includes pixels, where each pixel has been measured in effectively spectral bands. Authors in [fyzhu_2014_TIP_DgS_NMF] suggested the presence of basic elements in the dataset, while a smaller subset of them might be dominant.
We have randomly chosen points from this dataset and reduced the dimensionality of data to via PCA. According to the numbers suggested by the eigen-analysis of the data, number of vertices for the simplex is chosen as . Figure 4 shows the performance of the proposed method on the data compared to the ground truth profiles presented in the literature. Evidently, algorithm finds the dominant elements with a considerable accuracy.
This paper presents a set of bounds on the sample complexity of learning high-dimensional simplices from uniformly-sampled observations. In particular, the required number of samples for PAC-learnability of the ML estimator has been shown to be , which yields a significant improvement over existing results. In addition, a continuously-relaxed (and thus computationally tractable) sub-optimal solution is proposed for this task which is backed by rigorous theoretical guarantees. On the other hand, numerous experiments magnify the applicability of our method on real-world datasets. Our method has shown a comparable performance to a number of well-known rival strategies, while shows a considerable superiority in highly noisy regimes. In future approaches, one may attempt to provide similar sample complexity bounds for the noisy case, which has not been tackled yet. Moreover, there is still room for improvement of sub-optimal solutions in both noisy and noiseless cases.
Appendix A Proof of Theorem 1
The proof consists of two major steps. First, we will show that the solution to the proposed optimization problem, regardless of the choice of parameters and/or sample complexity, has always a volume less than or equal to that of the actual simplex . Also, the intersection of solution and is always non-empty. Second, it will be shown that for sufficiently large , solutions with insignificant overlap with are highly improbable. The mathematical tools behind this proof are basic concepts from algebra and high-dimensional geometry, as well as notions from Vapnik-Chervonenkis (VC) theory in statistical learning.
Before proceeding to the main body of proof, it would be helpful to make a number of definitions and introduce some useful notation. The proposed optimization is rewritten here
For any -simplex , we simply show the objective function in (7) by
For , define
as the set of all -simplices with volume less or equal than , a relative overlap of at most , and at least one point of intersection with . Also, let us define the functions for and as
This way, it can be easily seen that the loss function in (7) is a shifted version of the empirical mean computed over all , i.e. . The following lemma shows that minimization of over can be reduced to minimizing it over instead.
Consider the loss function as defined in (8). Then, there exists such that minimizing the over the two hypothesis sets and are equivalent. In other words,
where according to definitions, represents the set of -simplices in with maximal volume , which has non-empty intersections with .
See Appendix C for the proof. ∎
The lemma will enforce many useful properties over the space of possible solutions which can be used in later stages of the proof. Before that, let us make a couple of more definitions. Interestingly, it can be easily verified that for all () and all , has a finite value. Let us define as
and accordingly the function set
Note that according to the definition of , for every function maps the input space into the interval for all choices of and .
Next, we have to show that for sufficiently large , the deviation between empirical values of and its expected value (statistical risk) is strictly bounded for all and with a high probability. An important thing to note is that the risk can be written as an empirical mean, and thus enjoys a number of useful concentration bounds as long as the hypothesis (function) set has certain desired properties. In this regard, a fundamental theorem from Vapnik-Chervonenkis (VC) theory of statistical learning gives us the following probability bound, for all and :
where represents the Rademacher complexity of the function set for data set size [mohri2012foundations]. Rademacher complexity is a complexity measure showing how different the functions in a function set can behave comparing to each other.
Based on the definition of and considering the fact it intersects with in at least one point, it can be verified that
where represent the diameter of -simplex . Dealing with the term requires more sophisticated mathematical tools and analyses. In the following lemme we obtain an upper bound on the Rademacher complexity of the function set . This bound shows that both terms in the r.h.s. of the inequality in (11) decrease with the rate (ignoring the term ).
Assume the function set for a strictly increasing function and as
where and is defined as in (9). Then, the Rademacher complexity of can be bounded through the following inequality:
See Appendix D for the proof. ∎
Consider a -simplex with , which is assumed to be -isoperimetric for some . Assume to be a set of i.i.d. samples drawn uniformly from , and we have
Then, the following inequality holds for all :
where is defined as in (8).
See Appendix E for the proof. ∎
Through simple substitutions and subsequent simplifications, it can be said that for all and , the following lower-bound holds with probability at least :