PHom-GeM: Persistent Homology for Generative Models

05/23/2019 ∙ by Jeremy Charlier, et al. ∙ 0

Generative neural network models, including Generative Adversarial Network (GAN) and Auto-Encoders (AE), are among the most popular neural network models to generate adversarial data. The GAN model is composed of a generator that produces synthetic data and of a discriminator that discriminates between the generator's output and the true data. AE consist of an encoder which maps the model distribution to a latent manifold and of a decoder which maps the latent manifold to a reconstructed distribution. However, generative models are known to provoke chaotically scattered reconstructed distribution during their training, and consequently, incomplete generated adversarial distributions. Current distance measures fail to address this problem because they are not able to acknowledge the shape of the data manifold, i.e. its topological features, and the scale at which the manifold should be analyzed. We propose Persistent Homology for Generative Models, PHom-GeM, a new methodology to assess and measure the distribution of a generative model. PHom-GeM minimizes an objective function between the true and the reconstructed distributions and uses persistent homology, the study of the topological features of a space at different spatial resolutions, to compare the nature of the true and the generated distributions. Our experiments underline the potential of persistent homology for Wasserstein GAN in comparison to Wasserstein AE and Variational AE. The experiments are conducted on a real-world data set particularly challenging for traditional distance measures and generative neural network models. PHom-GeM is the first methodology to propose a topological distance measure, the bottleneck distance, for generative models used to compare adversarial samples in the context of credit card transactions.

READ FULL TEXT VIEW PDF

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Motivation

The field of unsupervised learning has evolved significantly for the past few years thanks to adversarial networks publications. In

[1], Goodfelow et al. introduced a Generative Adversarial Network framework called GAN. It is a class of generative models that play a competitive game between two networks in which the generator network must compete against an adversary according to a game theoretic scenario [2]. The generator network produces samples from a noise distribution and its adversary, the discriminator network, tries to distinguish real samples from generated samples, respectively samples inherited from the training data and samples produced by the generator. Meanwhile, Variational Auto-Encoders (VAE) presented by Kingma et al. in [3] have emerged as a well-established approach for synthetic data generation. Nevertheless, they might generate poor target distribution because of the KL divergence [2]. We recall an AE is a neural network trained to copy its input manifold to its output manifold through a hidden layer. The encoder function sends the input space to the hidden space and the decoder function brings back the hidden space to the input space. By applying some of the Optimal Transport (OT) concepts gathered in [4] and noticeably, the Wasserstein distance, Arjovsky et al. introduced the Wasserstein GAN (WGAN) in [5]. It tries to avoid the mode collapse, a typical training convergence issue occurring between the generator and the discriminator. Gulrajani et al. further optimized the concept in [6] by proposing a Gradient Penalty to Wasserstein GAN (GP-WGAN) capable to generate adversarial samples of higher quality. Similarly, Tolstikhin et al. in [7] applied the same OT concepts to AE and, therefore, introduced Wasserstein AE (WAE), a new type of AE generative model, that avoids the use of the KL divergence.

Nonetheless, the description of the distribution of the generative models, which involves the description of the generated scattered data points [8] based on the distribution of the original manifold , is very difficult using traditional distance measures, such as the Fréchet Inception Distance [7]. We highlight the distribution and the manifold notations in figure 1 for GAN and in figure 2 for AE. Effectively, traditional distance measures are not able to acknowledge the shapes of the data manifolds and the scale at which the manifold should be analyzed. However, persistent homology [9, 10] is specifically designed to highlight the topological features of the data [11]. Therefore, building upon persistent homology, Wasserstein distance [12] and generative models [7], our main contribution is to propose qualitative and quantitative ways to evaluate the scattered generated distributions and the performance of the generative models.

tikz/fig01

Figure 1: In PHom-GeM applied to GAN, the generative model generates fake samples based on the samples from the prior random distribution . Then, the discriminator model tries to differentiate the fake samples from the true samples . The original manifold and the generated manifold are transformed independently into metric space sets to obtain filtered simplicial complex. It leads to the description of topological features, summarized by the barcodes, to compare the respective topological representation of the true data distribution and the generative model distribution .

In this work we describe the persistent homology features of the generated model while minimizing the OT function for a squared cost where is the model distribution of the data contained in the manifold , and the distribution of the generative model capable of generating adversarial samples. Our contributions are summarized below:

  • A persistent homology procedure for generative models, including GP-WGAN, WGAN, WAE and VAE, which we call PHom-GeM to highlight the topological properties of the generated distributions of the data for different spatial resolutions. The objective is a persistent homology description of the generated data distribution following the generative model .

  • A distance measure for persistence diagrams, the bottleneck distance, applied to generative models to compare quantitatively the true and the target distributions on any data set. We measure the shortest distance for which there exists a perfect matching between the points of the two persistence diagrams. A persistence diagram is a stable summary representation of topological features of simplicial complex, a collection of vertices, associated to the data set.

  • Finally, we propose the first application of algebraic topology and generative models on a public data set containing credit card transactions, particularly challenging for this type of models and traditional distance measures.

The paper is structured as follows. In section II, we review the optimized GP-WGAN and WAE formulations using OT derived by Gulrajani et al. in [6] and Tolstikhin et al. in [7], respectively. By using persistence homology, we are able to compare the topological properties of the original distribution and the generated distribution . We highlight experimental results in section III and we conclude in section IV by addressing promising directions for future work.

tikz/fig02

Figure 2: In PHom-GeM applied to AE, the generative model , the decoder, is used to generate fake samples based on the samples from a prior random distribution . Afterward, the original manifold and the generated manifold are both transformed independently into metric space sets to obtain filtered simplicial complex. As for PHom-GeM applied to GAN, it leads to the description of topological features, summarized by the barcodes, to compare the respective topological representation of the true data distribution and the generative model distribution .

Ii Proposed Method

Our method computes the persistence homology of both the true manifold and the generated manifold following the generative model based on the minimization of the optimal transport cost . In the resulting topological problem, the points of the manifolds are transformed to a metric space set for which a Vietoris-Rips simplicial complex filtration is applied (see definition 2). PHom-GeM achieves simultaneously two main goals: it computes the birth-death of the pairing generators of the iterated inclusions while measuring the bottleneck distance between persistence diagrams of the manifolds of the generative models.

Ii-a Optimal Transport and Dual Formulation

Following the description of the optimal transport problem [4] and relying on the Kantorovich-Rubinstein duality, the Wasserstein distance is computed as

(1)

where is a metric space,

is a set of all joint distributions

with marginals and respectively and is the class of all bounded 1-Lipschitz functions on .

Ii-B Gradient Penalty Wasserstein GAN (GP-WGAN)

As described in [6]

, the GP-WGAN objective loss function with gradient penalty is expressed such that

(2)

where is the set of 1-Lipschitz functions on , the original data distribution, the generative model distribution implicitly defined by . The input

to the generator is sampled from a noise distribution such as a uniform distribution.

defines the uniform sampling along straight lines between pairs of points sampled from the data distribution and the generative distribution . A penalty on the gradient norm is enforced for random samples . For further details, we refer to [6] and [5].

Ii-C Wasserstein Auto-Encoders

As described in [7], the WAE objective function is expressed such that

(3)

where is any measurable cost function. In our experiments, we use a square cost function for data points . denotes the sending of to for a given map . , and , are any nonparametric set of probabilistic encoders, and decoders respectively.

We use the Maximum Mean Discrepancy (MMD) for the penalty for a positive-definite reproducing kernel

(4)

where is the reproducing kernel Hilbert space of real-valued functions mapping on . For details on the MMD implementation, we refer to [7].

Ii-D Persistence Diagram and Vietoris-Rips Complex

Definition 1 Let be a set of vertices. A simplex is a subset of vertices . A simplicial complex K on V is a collection of simplices , such that . The dimension of is its number of elements minus 1. Simplicial complexes examples are represented in figure 3.

tikz/complex

Figure 3: A simplical complex is a collection of numerous “simplex” or simplices, where a 0-simplex is a point, a 1-simplex is a line segment, a 2-simplex is a triangle and a 3-simplex is a tetrahedron.

Definition 2 Let be a metric space. The Vietoris-Rips complex at scale associated to is the abstract simplicial complex whose vertex set is , and where is a -simplex if and only if for all .

We obtain an increasing sequence of Vietoris-Rips complex by considering the for an increasing sequence of value of the scale parameter

(5)

Applying the k-th singular homology functor with coefficient in the field [13] to (5), we obtain a sequence of

-vector spaces, called the

k-th persistence module of

(6)

Definition 3 , the (i,j)-persistent -homology group with coefficient in of denoted is defined to be the image of the homomorphism .

Using the interval decomposition theorem [14], we extract a finite family of intervals of called persistence diagram. Each interval can be considered as a point in the set . Hence, we obtain a finite subset of the set . This space of finite subsets is endowed with a matching distance called the bottleneck distance and defined as follow

where , , and the is over all the bijections from to .

Ii-E Application: PHom-GeM, Persistent Homology for Generative Models

Bridging the gap between persistent homology and generative models, PHom-GeM uses a two-steps procedure. First, the minimization problem is solved for the generator and the discriminator when considering GP-WGAN and WGAN. The gradient penalty in equation (2) is fixed equal to 10 for GP-WGAN and to 0 for WGAN. For auto-encoders, the minimization problem is solved for the encoder and the decoder

. We use RMSProp optimizer

[15] for the optimization procedure. Then, the samples of the original and generated distributions, and , are mapped to persistence homology for the description of their respective manifolds. The points contained in the manifold inherited from and the points contained in the manifold generated with are randomly selected into respective batches. Two samples, from following and from following , are selected to differentiate the topological features of the original manifold and the generated manifold . The samples and are contained in the spaces and , respectively. Then, the spaces and are transformed into metric space sets and for computational purposes. Then, we filter the metric space sets and using the Vietoris-Rips simplicial complex filtration. Given a line segment of length , vertices between data points are created for data points respectively separated from a smaller distance than . It leads to the construction of a collection of simplices resulting in Vietoris-Rips simplicial complex VR filtration. In our case, we decide to use the Vietoris-Rips simplicial complex as it offers the best compromise between the filtration accuracy and the memory requirement [11]. Subsequently, the persistence diagrams, and , are constructed. We recall a persistence diagram is a stable summary representation of topological features of simplicial complex. The persistence diagrams allow the computation of the bottleneck distance . Finally, the barcodes represent in a simple way the birth-death of the pairing generators of the iterated inclusions detected by the persistence diagrams.

1.25 Data:

training and validation sets, hyperparameter

Result: persistent homology description of generative manifolds
1 begin
2       /*Step 1: Generative Models Resolution*/ Select samples from training set Select samples from validation set With RMSProp gradient descent update (), optimize until convergence and
3          case GP-WGAN and WGAN: using equation 2    case WAE: using equation 3    case VAE: using equation described in [3] /*Step 2: Persistence Diagram and Bottleneck Distance on manifolds of generative models*/ Random selection of samples from and Transform and spaces into a metric space set Filter the metric space set with a Vietoris-Rips simplicial complex Compute the persistence diagrams and Evaluate the bottleneck distance Build the barcodes with respect to and
4
return
Algorithm 1 Persistent Homology for Generative Models

Iii Experiments

We empirically evaluate the proposed methodology PHom-GeM. We assess on a highly challenging data set for generative models whether PHom-GeM can simultaneously achieve (i) precise persistent homology mapping of the generated data points and (ii) accurate persistent homology distance measurement with the bottleneck distance.

Data Availability and Data Description We train PHom-GeM on one real-world open data set: the credit card transactions data set from the Kaggle database111The data set is available at https://www.kaggle.com/mlg-ulb/creditcardfraud. containing 284 807 transactions including 492 frauds. This data set is particularly interesting because it reflects the scattered points distribution of the reconstructed manifold that are found during generative models’ training, impacting afterward the generated adversarial samples. Furthermore, this data set is challenging because of the strong imbalance between normal and fraudulent transactions while being of high interest for the banking industry. To preserve transactions confidentiality, each transaction is composed of 28 components obtained with PCA without any description and two additional features Time and Amount that remained unchanged. Each transaction is labeled as fraudulent or normal in a feature called Class which takes a value of 1 or 0, respectively.

Experimental Setup and Code Availability In our experiments, we use the Euclidean latent space and the square cost function previously defined as for the data points . The dimensions of the true data set is . We kept the 28 components obtained with PCA and the amount resulting in a space of dimension 29. For the error minimization process, we used RMSProp gradient descent [15] with the parameters and a batch size of 64. Different values of for the gradient penalty have been tested. We empirically obtained the lowest error reconstruction with for both GP-WGAN and WAE. The coefficients of persistence homology are evaluated within the field . We only consider homology groups and who represent the connected components and the loops, respectively. Higher dimensional homology groups did not noticeably improve the results quality while leading to longer computational time. The simulations were performed on a computer with 16GB of RAM, Intel i7 CPU and a Tesla K80 GPU accelerator. To ensure the reproducibility of the experiments, the code is available at the following address222The code is available at https://github.com/dagrate/phomgem.

Results and Discussions about PHom-GeM

We test PHom-GeM, Persistent Homology for Generative Models, on four different generative models: GP-WGAN, WGAN, WAE and VAE. We compare the performance of PHom-GeM on two specificities: first, qualitative visualization of the persistence diagrams and barcodes and, secondly, quantitative estimation of the persistent homology closeness using the bottleneck distance between the generated manifolds

of the generative models and the original manifold .

On the top of figure 4, the rotated persistence and the barcode diagrams of the original sample are highlighted. In the persistence diagram, black points represent the 0-dimensional homology groups , the connected components of the complex. The red triangles represent the 1-dimensional homology group , the 1-dimensional features known as cycles or loops. The barcode diagram is a simple way of representing the information contained in the persistence diagram. For the sake of simplicity, we represent only the barcode diagram of the generative models to compare qualitatively the generated distribution of each model with respect to the distribution of the original sample. The generated distribution of GP-WGAN is the closest to the distribution followed by WGAN, WAE and VAE. Effectively, the spectrum of the barcodes of GP-WGAN is very similar to the original sample’s spectrum as well as denser on the right. On the opposite, the WAE and VAE’s distributions are not able to reproduce all of the features contained in the original distribution, therefore explaining the narrower barcode spectrum.

Persistence Diagram
Original Sample

Barcodes
Original Sample

PHom for GP-WGAN

PHom for WGAN

PHom for WAE

PHom for VAE

Figure 4: On top, the rotated persistence and the barcode diagrams of the original sample are represented. They both illustrate the birth-death of the pairing generators of the iterated inclusions. In the persistence diagram, the black points represent the connected components of the complex and the red triangles the cycles. Below, GP-WGAN, WGAN, WAE and VAE’s barcode diagrams allows to assess qualitatively, with the original sample barcode diagram, the persistent homology similarities between the generated and the original distribution, and respectively.

In order to quantitatively assess the quality of the generated distributions, we use the bottleneck distance between the persistent diagram of and the persistent diagram of of the generated data points. In table I

, we highlight the mean value of the bottleneck distance for a 95% confidence interval. We also underline the lower and the upper bounds of the 95% confidence interval for each generative model. Confirming the visual observations, we notice the smallest bottleneck distance, and therefore, the best result, is obtained with GP-WGAN, followed by WGAN, WAE and VAE. It means GP-WGAN is capable to generate data distribution sharing the most topological features with the original data distribution, including the nearness measurements and the overall shape. It confirms topologically on a real-world data set the claims addressed in

[6] of superior performance of GP-WGAN against WGAN. Furthermore, the performance of the AE cannot match the generative performance achieved by the GANs. However, the WAE, that relies on optimal transport theory, achieves better generative distribution in comparison to the popular VAE.

Gen. Model Mean Value Lower Bound Upper Bound
GP-WGAN 0.0711 0.0683 0.0738
WGAN 0.0744 0.0716 0.0772
WAE 0.0821 0.0791 0.0852
VAE 0.0857 0.0833 0.0881
Table I: Bottleneck distance (smaller is better) with 95% of confidence interval between the samples of the original manifold and the generated samples of the manifold . Because of the Wasserstein distance and gradient penalty, GP-WGAN achieves better performance.

Iv Conclusion

Building upon optimal transport and unsupervised learning, we introduced PHom-GeM, Persistent Homology for Generative Models, a new characterization of the generative manifolds that uses topology and persistence homology to highlight manifold features and scattered generated distributions. We discuss the relations of GP-WGAN, WGAN, WAE and VAE in the context of unsupervised learning. Furthermore, relying on persistence homology, the bottleneck distance has been introduced to estimate quantitatively the topological features similarities between the original distribution and the generated distributions of the generative models, a specificity that current traditional distance measures fail to acknowledge. We conducted experiments showing the performance of PHom-GeM on the four generative models GP-WGAN, WGAN, WAE and VAE. We used a challenging imbalanced real-world open data set containing credit card transactions, capable of illustrating the scattered generated data distributions of the generative models and particularly suitable for the banking industry. We showed the superior topological performance of GP-WGAN in comparison to the other generative models as well as the superior performance of WAE over VAE. Future work will include further exploration of the topological features such as the influence of the simplicial complex and the possibility to integrate a topological optimization function as a regularization term.

References