Quantile-Quantile Embedding for Distribution Transformation, Manifold Embedding, and Image Embedding with Choice of Embedding Distribution

06/19/2020 ∙ by Benyamin Ghojogh, et al. ∙ University of Waterloo 0

We propose a new embedding method, named Quantile-Quantile Embedding (QQE), for distribution transformation, manifold embedding, and image embedding with the ability to choose the embedding distribution. QQE, which uses the concept of quantile-quantile plot from visual statistical tests, can transform the distribution of data to any theoretical desired distribution or empirical reference sample. Moreover, QQE gives the user a choice of embedding distribution in embedding manifold of data into the low dimensional embedding space. It can also be used for modifying the embedding distribution of different dimensionality reduction methods, either basic or deep ones, for better representation or visualization of data. We propose QQE in both unsupervised and supervised manners. QQE can also transform the distribution to either the exact reference distribution or shape of the reference distribution; and one of its many applications is better discrimination of classes. Our experiments on different synthetic and image datasets show the effectiveness of the proposed embedding method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 8

Code Repositories

Quantile-Quantile-Embedding

The code for Quantile-Quantile Embedding (QQE).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider some data which have some probability distribution. The distribution of data may be a standard or any strange distribution. We can transform the data so their distribution becomes a desired distribution. However, this transformation should not significantly modify the relative local distances of nearby data points

[saul2003think]

so that the information of data does not get destroyed. For this distribution transformation, one can try to make all the moments of data equal to the moments of the desired distribution

[gretton2007kernel, gretton2012kernel]

. However, because of the huge number of moments, it can be computationally expensive. Also, matching all the moments results in transformation to the exact desired distribution but not the “shape” of the desired distribution. One may just want to transform the shape of distribution to the desired one and not to the exact distribution. Furthermore, moments of non-standard distributions can be hard to compute in some cases. Hence, a method for distribution transformation is required which can be used for any desired distribution, either standard or non-standard distributions. This method should also support a desired distribution either as a theoretical Probability Density Function (PDF)/Cumulative Distribution Function (CDF) or as an empirical reference sample.

In the field of manifold learning and dimensionality reduction, the choice of embedding distribution is usually not given to the user. Some methods take an assumption on the distribution of neighbors of data points. For example Stochastic Neighbor Embedding (SNE) and t-SNE take Gaussian distribution

[hinton2003stochastic] and Cauchy [maaten2008visualizing] (or Student-t [van2009learning]

) distribution for the neighborhood of points. These methods make some strong assumptions on the neighborhood of points and do not give freedom of choice to the user for the embedding distribution. Some manifold learning methods, however, do not even make any assumption on the embedding distribution and yet do not give any choice of embedding distribution to the user. Some examples are Principal Component Analysis (PCA)

[ghojogh2019unsupervised], Multi-dimensional Scaling (MDS) [cox2008multidimensional], Sammon mapping [sammon1969nonlinear], Fisher Discriminant Analysis (FDA) [ghojogh2019fisher], Isomap [tenenbaum2000global], LLE [roweis2000nonlinear, saul2003think], and deep manifold learning [he2016deep, schroff2015facenet]. Note that some of these methods make assumptions but not as a distribution for the embedding. For example, FDA assumes Gaussian distribution for data in the input space and LLE assumes just unit covariance and mean zero for the embedded data. There is a need for a manifold learning method which gives the user the freedom to choose the embedding distribution either for the whole data or each class of data separately.

In this paper, we propose a new embedding method, named Quantile-Quantile Embedding (QQE), which can be used for distribution transformation, manifold learning, and image embedding with choice of embedding distribution. The features and advantages of QQE are summarized in the following:

  1. Distribution transformation to a desired distribution either as a PDF/CDF or an empirical reference distribution given by user. Also, either the whole data or every class of data can be transformed in unsupervised and supervised manners, respectively.

  2. Manifold embedding of high dimensional data into a lower dimensional embedding space with the choice of embedding distribution by the user. Again, the embedding distribution of either the whole data or every class can be determined in unsupervised and supervised manners, respectively. Manifold embedding in QQE can also modify the embedding of other manifold learning methods, either basic or deep ones, for better discrimination of classes or better representation/visualization of data.

  3. For both distribution transformation and manifold embedding tasks, the distribution can be transformed to either the exact desired distribution or merely the shape of it. One of the many applications of exact distribution transformation is separation of classes in data.

This paper is organized as follows. Section II introduces the background on quantile functions, the univariate quantile-quantile plot, and its multivariate version. In Section III, we propose the QQE method for both distribution transformation and manifold embedding. The experimental results are reported in Section IV. Finally, Section V concludes the paper and enumerates the future directions.

Ii Quantile and Quantile-Quantile Plots

Ii-a Quantile Function and Quantile Plot

The quantile function for a distribution is defined as [parzen1979nonparametric, hyndman1996sample]:

(1)

where is called position and is the CDF. The quantile function can also be defined as:

(2)

where

is a random variable with

[ferguson1967mathematical, serfling2004nonparametric]. The two-dimensional plot is called the quantile plot which was first proposed by Sir Francis Galton [galton1885application]. Its name was ogival curve

primarily as it was like an ogive because of the normal distribution of his measured experimental sample.

If we have a drawn sample, with sample size , from a distribution, the quantile plot is a sample (or empirical) quantile. The sample quantile plot is . For the sample quantile, we can determine the -th position, denoted by , as:

(3)

where different values for and result in different positions [leon1984another]. The simplest type of position is (with ) [parzen1979nonparametric]. The most well-known position is (with ) [allen1914storage]. However, it is suggested in [hyndman1996sample] to use (with ) which is median unbiased [reiss2012approximate]. It is noteworthy that Galton also suggested that we can measure the quantile function only in as a summary [galton1874proposed]. His summary is promising only for the normal distribution; however, with the power of today’s computers we can compute the sample quantile with fine steps.

For the multivariate quantile plot, spatial rank fulfills the role played by position in the univariate case. Spatial rank of with respect to the sample is defined as [mottonen1995multivariate, marden2004positions, serfling2004nonparametric, dhar2014comparison]:

(4)

whose term in the summation is a generalization of the sign function for the multivariate vector

[marden2004positions]. Eq. (2) can be restated as where [chaudhuri1996geometric]. The multivariate spatial quantile (or geometrical quantile) for the multivariate spatial rank is defined as:

(5)

where is a random vector, , and is a vector in unit ball, i.e., [chaudhuri1996geometric, serfling2004nonparametric, dhar2014comparison].

Ii-B Quantile-Quantile Plot

Assume we have two quantile functions for two univariate distributions. If we match their positions and plot , we will have quantile-quantile plot or qq-plot in short [loy2016variations]. Again, this plot can be an empirical plot, i.e.,

. Note that the qq-plot is equivalent to the quantile plot for the uniform distribution as we have

in this distribution. Usually, as a statistical test, we want to see whether the first distribution is similar to the second empirical or theoretical distribution [loy2016variations]; therefore, we refer to the first and second distributions as the observed and reference distributions, respectively [easton1990multivariate]. Note that if the qq-plot of two distributions is a line with slope (angle ) and intercept , the two distributions have the same distributions [oldford2016self]. The slope and the intercept of the line show the difference of spread and location of the two distributions [loy2016variations].

In order to extend the qq-plot to multivariate distributions, we can consider the marginal quantiles. However, this fails to take the dependence of marginals into account [dhar2014comparison, easton1990multivariate]. There exist different methods for a promising generalization. One of these methods is fuzzy qq-plot [easton1990multivariate] (note that it is not related to fuzzy logic). In a fuzzy qq-plot, a sample of size is drawn from the reference distribution and the data points of the two samples are matched using optimization. An affine transformation is also applied to the observed sample in order to have an invariant comparison to the affine transformation. In the multivariate qq-plot, the matched data points are used to plot the qq-plots for every component; therefore, we will have qq-plots where is the dimensionality of data. Note that these plots are different from the qq-plots for the marginal distributions. The technical details of fuzzy qq-plot is explained in the following.

Ii-C Multivariate Fuzzy Quantile-Quantile Plot

Assume we have a dataset with size and dimensionality , i.e., . We want to transform its distribution as . We draw a sample of size from the desired (reference) distribution. Note that in case we already have a reference sample rather than the reference distribution, we can employ bootstrapping or oversampling if and , respectively, to have . We match the data points and [easton1990multivariate]:

(6)

where and are used to make the matching problem invariant to affine transformation. If is the set of all possible permutations of integers , we have . This optimization problem finds the best permutation regardless of any affine transformation.

In order to solve this problem, we iteratively switch between solving for , , and until there is no change in [easton1990multivariate]. Given and , we solve:

(7)

which is an assignment problem and can be solved using the Hungarian method [kuhn1955hungarian]. The and are the cost matrix and a matrix with only one in every row, respectively. Note that means that the and are matched. The should be computed before solving the optimization where .

According to the ’s in the obtained , we have . Then given , we solve:

(8)

which is a multivariate regression problem. The solution is [hastie2009elements]:

(9)

where and . We will have . Therefore, and are found where is the top sub-matrix of and is the last row of .

Note that It is better to set the initial rotation matrix to the identity matrix, i.e.

, for not having much rotation in assignment. In this way, only few iterations suffice to solve the matching problem. This iterative optimization gives us the matching and the samples and are matched. Then, we have qq-plots, one for every dimension. These qq-plots are named fuzzy qq-plots [easton1990multivariate]. Considering the spatial ranks, the quantiles are [dhar2014comparison]:

(10)
(11)

Iii Quantile-Quantile Embedding

Now, we provide our definition for distribution transformation:

Definition 1 (distribution transformation).

For a sample of size in space, the mapping is a distribution transformation where the distribution of is the known desired distribution and the local distances of nearby points in are preserved in as much as possible.

Distribution transformation can be performed in two approaches. In the first approach, (i) the distribution of data is transformed to the “exact” reference distribution, (ii) while in the second approach, only the “shape” of the reference distribution is considered to transform to. In the following Subsections III-A and III-B, we detail the two approaches, respectively. Then, we introduce manifold embedding using QQE in subsection III-C. Finally, Subsection III-D explains the unsupervised and supervised approaches for QQE.

Iii-a Distribution Transformation to Exact Reference Distribution

When the qq-plots are obtained by the fuzzy qq-plot, we can use them to embed the data for distribution transformation. Consider the transformation of an initial sample to . We want the distribution of sample to become the same as the distribution of the reference sample or the reference distribution. Therefore, the qq-plot of every dimension should be a line with slope one and intercept zero [oldford2016self]. Let denote the -th dimension of which is used for the -th data point in the -th qq-plot. Consider for the matched data and the reference sample, denoted by and , respectively. In order to have the line in the qq-plot, we should minimize . According to Eqs. (10) and (11), this cost function is equivalent to where and denote the -th dimension of and , respectively. In vector form, the cost function is restated as:

(12)

On the other hand, according to our definition of distribution transformation, we should also preserve the local distances of the nearby data points as far as possible to embed the data locally [saul2003think]. For preserving the local distances, we minimize the differences of local distances between the data and transformed data. Using the -nearest neighbors (-NN) graph for the set . Let denote the set containing the indices of the neighbors of . The cost to be minimized is:

(13)

where , , and is the normalization factor. The weight gives more value to closer points as expected. Note that if , the Eq. (13) is the cost function used in Sammon mapping [sammon1969nonlinear, lee2007nonlinear]. We use this cost as a regularization term in our optimization. Therefore, our optimization is:

(14)

where is the regularization parameter.

Proposition 1.

The gradient of the cost function with respect to is:

(15)
Proof.

Proof in Appendix A. ∎

Proposition 2.

The second derivative of the cost function with respect to is:

(16)
Proof.

Proof in Appendix B. ∎

We use the quasi-Newton’s method [nocedal2006numerical] for solving this optimization problem inspired by [sammon1969nonlinear]. If we consider the vectors component-wise, the diagonal quasi-Newton’s method updates the solution as [lee2007nonlinear]:

(17)

, where is the index of iteration, is the learning rate, and denotes the absolute value guaranteeing that we move toward the minimum and not maximum in the Newton’s method.

Iii-B Distribution Transformation to the Shape of Reference Distribution

We can ignore the location and scale of the reference distribution and merely change the distribution of the observed sample to look like the “shape” of the reference distribution regardless of its location and scale. Recall that if the qq-plot is a line, the shapes of the distributions are the same where the intercept and slope of the line correspond to the location and scale [oldford2016self]

. Therefore, in our optimization, rather than trying to make the qq-plot a line with slope one and intercept zero, we try to make it the closest line possible. This line can be found by fitting a line as a least squares problem, i.e., a linear regression problem. For the qq-plot of every dimension, we fit a line to the qq-plot. If we define

, let . Fitting a line to the qq-plot of the -th dimension is the following least squares problem:

(18)

whose solution is [hastie2009elements]:

(19)

where . The points on the line fitted to the qq-plot of the -th dimension are:

(20)

which are used instead of in our optimization. Defining , the optimization problem is:

(21)

Similar to Proposition 1, the gradient is:

(22)

and the second derivative is the same as Proposition 2. We again solve using diagonal quasi-Newton’s method [nocedal2006numerical].

Iii-C Manifold Embedding

QQE can be used for manifold embedding in a lower dimensional embedding space where the embedding distribution can be determined by the user. As an initialization, the high dimensional data are embedded in a lower dimensional embedding space using a dimensionality reduction method. Thereafter, the low dimensional embedding data are transformed to a desired distribution using QQE.

Any dimensionality reduction method can be utilized for the initialization of data in the low dimensional subspace. Some examples are PCA [ghojogh2019unsupervised] (or metric MDS [cox2008multidimensional]), FDA [ghojogh2019fisher], Isomap [tenenbaum2000global], LLE [roweis2000nonlinear], t-SNE [van2009learning]

, and deep features like triplet Siamese features

[schroff2015facenet] and ResNet features [he2016deep].

After the initialization, a reference sample is drawn from the reference distribution or is taken from the user. The dimensionality of the reference sample is equal to the dimensionality of the low dimensional embedding space. We transform the distribution of the low dimensional data to the reference distribution using QQE. Again, the distribution transformation can be either to the exact or shape of the desired distribution.

Iii-D Unsupervised and Supervised Embedding

QQE, for both tasks of distribution transformation and manifold embedding, can be used in either supervised or unsupervised manners. In an unsupervised manner, the distribution of all the data points is transformed to the desired distribution; however, in the supervised manner, the data points of each class are transformed to have the desired distribution. Hence, in the supervised case, the user can even choose different distributions for the different classes. Note that in both unsupervised and supervised cases, the distribution transformation can be either to the exact or shape of reference distribution. For the supervised case in the distribution transformation task, the distribution of every class is transformed; in the manifold learning task, the distribution of low dimensional data of every class is transformed no matter whether the dimensionality reduction method for initialization is unsupervised or supervised.

Iv Experiments

Figure 1: Distribution transformation of S-shape and uniform data to each other. The first and second pair of rows correspond to transformation of shape and exact distributions, respectively. The arrows show the direction of gradual changes.
Figure 2: Distribution transformation using (a) CDF of reference distribution: (b) the reference data, (c) Gaussian data, and (d) transformed data.
Figure 3: Distribution transformation of facial images without eyeglasses to the shape of images with eyeglasses. The arrow shows the direction of gradual changes.
Figure 4: Unsupervised and supervised exact manifold embedding of the synthetic data with different initializations. Transformation to exact reference distribution is also shown. The initialization of LLE is scaled by constant to be in range of other embeddings.

Iv-a Discussion on Impact of Hyperparameters

For all experiments in this article, we set , , and . QQE is not yet applicable on out-of-sample data (see Section V

) so these parameters cannot be determined by validation; however, here, we briefly discuss the impact of these hyperparameters. The learning rate

should be set small enough to have progress in optimization without oscillating behaviour. We empirically found to be good for different datasets. The larger number of neighbors results in slower pacing of optimization because of Eqs. (15) and (16). Very small , however, does not capture the local patterns of data [saul2003think]. The value is fairly proper. The regularization parameter determines the importance of distance preserving compared to the quantile-quantile plot of distributions. The larger this parameter gets, the less important the distribution transformation becomes compared to preserving distances; hence, the slower the progress of optimization gets. The value was empirically found to be proper for different datasets.

Iv-B Distribution Transformation for Synthetic Data

To visually show how distribution transformation works, we report the results of QQE on some synthetic datasets. In the following, we report several different possible cases for distribution transformation.

Figure 5: Unsupervised and supervised exact manifold embedding of the image data with different initializations. Transformation to exact reference distribution is also shown. The initialization of LLE is scaled by constant to be in range of other embeddings.
Figure 6: Some iterations of unsupervised manifold embedding initialized by PCA and t-SNE. The arrow shows the direction of gradual changes.
Figure 7: Separation and discrimination of classes in synthetic and image data. The arrow shows the direction of gradual changes.

Iv-B1 Standard Reference Distributions

A simple option for the reference distribution is a standard probability distribution. As an example, we drew a sample of size from the two dimensional uniform distribution in range in both dimensions. This sample is depicted at the right hand side of Fig. 1. We also created an S-shape dataset, with mean zero and scale three, illustrated at the left hand side of Fig. 1. As this figure shows, in transforming the S-shape data to the shape of uniform distribution, the dataset gradually expands to fill the gaps and become similar to uniform without changing its mean and scale. In transforming to the exact uniform distribution, however, the mean and scale of data change gradually, by translation and contraction, to match the moments of the reference distribution.

Iv-B2 Given Reference Sample

We can have a reference sample which we want to transform the distribution of data to its distribution. An example is the S-shape data shown in Fig. 1 where we transform the uniform data to its distribution. In shape transformation, two gaps appear first to imitate the S shape and then the stems become narrower iteratively. In exact transformation, however, the mean and scale of data also change. Note that exact transformation is harder than shape transformation because of change of moments; thus, some points jump at initial iterations and then converge gradually. In Section V, we report a future work to make QQE more robust to these jumps.

Iv-B3 Given Cumulative Distribution Function

Instead of a standard reference distribution or a reference sample, the user can give a desired CDF for the distribution to have. The reference sample can be sampled using the inverse CDF. The CDF can be multivariate; however, for the sake of visualization, Fig. 2-a shows an example multi-modal univariate CDF. We used this CDF and uniform distribution for the first and second dimension of the reference sample, respectively, shown in Fig. 2-b. QQE was applied on the Gaussian data shown in Fig. 2-c and its distribution changed to have a CDF similar to the reference CDF (see Fig. 2-d).

Iv-C Distribution Transformation for Image Data

The distribution transformation can be used for any real data such as images. We divided the ORL facial images [samaria1994parameterisation] into two sets of with and without eyeglasses. The set with eyeglasses was taken as the reference sample and we transformed the set without glasses to have the shape of reference distribution. Figure 3 illustrates the gradual change of two example faces from not having eyeglasses to having them. The glasses have appeared gradually in the eye regions of faces.

Iv-D Manifold Embedding for Synthetic Data

To test QQE for manifold embedding, we created a three dimensional synthetic dataset having three classes shown in Fig. 4. Different dimensionality reduction methods, including PCA [ghojogh2019unsupervised], FDA [ghojogh2019fisher], Isomap [tenenbaum2000global], LLE [roweis2000nonlinear], and t-SNE [van2009learning], were used for initialization (see Fig. 4). We used uniform distribution as reference and transformed the embedded data in unsupervised manner. As Fig. 4 shows, the embeddings of the entire dataset have changed to have the shape of uniform distribution but the order and adjacency of classes/points differ according to the initialization methods. On the other hand, the supervised QQE has made the shape of distribution of every class uniform, depicted in Fig. 4. Finally, supervised transformation of the embedded data to the exact reference distributions, which are uniform distributions with different means, are shown in Fig. 4. In exact transformation, the order of points differ depending on the initialization method but the data patterns are similar so we show only one result.

Iv-E Image Manifold Embedding

QQE can be used for manifold embedding of real data such as images. For the experiments, we sampled 10000 images from the MNIST digit dataset [lecun1998gradient] with 1000 images per digit. This sampling is because of computational reasons for the time complexity of QQE (see Section V). We used different initialization methods, i.e., PCA [ghojogh2019unsupervised], FDA [ghojogh2019fisher], Isomap [tenenbaum2000global], LLE [roweis2000nonlinear], t-SNE [van2009learning], ResNet-18 features [he2016deep] (with cross entropy loss after the embedding layer), and deep triplet Siamese features [schroff2015facenet] (with ResNet-18 as the backbone network). Any embedding space dimensionality can be used but here, for visualization, we took it to be two.

Figure 5

shows the experiments. For unsupervised QQE, we took ring stripe, filled circle, uniform (square), Gaussian mixture model, triangle, diamond, and thick square as the reference distribution for embedding initialized by PCA, FDA, Isomap, LLE, t-SNE, ResNet, and Siamese net, respectively. As shown in Fig.

5, the shape of embedding has changed to the desired while the local distances are preserved as much as possible. Figure 6 illustrates some iterations of changes in PCA and t-SNE embeddings as examples.

For supervised transformation to the shape of references distributions, we used different distributions to show that QQE can use any various references for different classes. Helix, circle, S-shape, uniform, and Gaussian were used for the digits 0/1, 2/3, 4/5, 6/7, 8/9, respectively. Figure 5 depicts the supervised transformation to shapes of distributions. Moreover, we set the means of reference distribution to be on a global circular pattern. This resulted in the transformation to the exact reference distributions shown in Fig. 5. The embedded digit images are also shown in this figure.

Iv-F QQE for Separation of Classes

QQE can be used for separation and discrimination of classes; although, it does not yet support out-of-sample data (see Section V). For this, reference distributions with far-away means can be chosen where transformation to the exact distribution is used. Hence, the classes move away to match the first moments of reference distributions. We experimented this for both synthetic and image data. A two dimensional synthetic dataset with three mixed classes was created as shown in Fig. 7. The three classes are gradually separated by QQE to match three Gaussian reference distributions with apart means.

For image data, we used the ORL face dataset [samaria1994parameterisation] with two classes of faces with and without eyeglasses. The distribution transformation was performed in the input (pixel) space. The two dimensional embeddings, for visualization in Fig. 7, were obtained using the Uniform Manifold Approximation and Projection (UMAP) [mcinnes2018umap]. The dataset was standardized and the reference distributions were set to be two Gaussian distributions with apart means. As the figure shows, the two classes are mixed first but gradually the two classes are completely separated by QQE.

V Conclusion and Future Directions

In this paper, we proposed QQE for distribution transformation, manifold embedding, and image embedding. This method can be used for both transforming to the exact reference distribution or shape of it. Both unsupervised and supervised versions of this method were also proposed. The proposed method was based on quantile-quantile plot which is usually used in visual statistical tests.

There exist several possible future directions. The first future direction is to improve the time complexity of QQE is because of the assignment problem [edmonds1972theoretical] and the optimization steps, respectively. Since the complexity of QQE is

, dealing with big data would be a challenge for this initial version. Thus, the immediate future direction for research would be to develop a more sample-efficient approach including handling large datesets. Handling out-of-sample data is another possible future direction. Moreover, QQE uses the least squares problem which is not very robust. Because of this, especially if the moments of data and reference distribution differ significantly and we want to transform to the exact reference distribution, some jumps of some data points may happen at initial iterations. This results in later convergence of QQE. One may investigate high breakdown estimators for robust regression

[yohai1987high] to make QQE more robust and faster.

Appendix A Proof of Proposition 1

Consider the first part of the cost function:

Consider the second part of the cost function:

By chain rule,

. The first derivative is:

and using the chain rule, the second derivative is . We have:

(23)

Considering both parts of the cost function, the gradient is as in the proposition. Q.E.D.

Appendix B Proof of Proposition 2

The second derivative is the derivative of the first derivative, i.e., Eq. (15). Hence:

Putting all parts of derivative together gives the second derivative. Q.E.D.

References