Class-Conditional VAE-GAN for Local-Ancestry Simulation

11/27/2019 ∙ by Daniel Mas Montserrat, et al. ∙ 0

Local ancestry inference (LAI) allows identification of the ancestry of all chromosomal segments in admixed individuals, and it is a critical step in the analysis of human genomes with applications from pharmacogenomics and precision medicine to genome-wide association studies. In recent years, many LAI techniques have been developed in both industry and academic research. However, these methods require large training data sets of human genomic sequences from the ancestries of interest. Such reference data sets are usually limited, proprietary, protected by privacy restrictions, or otherwise not accessible to the public. Techniques to generate training samples that resemble real haploid sequences from ancestries of interest can be useful tools in such scenarios, since a generalized model can often be shared, but the unique human sample sequences cannot. In this work we present a class-conditional VAE-GAN to generate new human genomic sequences that can be used to train local ancestry inference (LAI) algorithms. We evaluate the quality of our generated data by comparing the performance of a state-of-the-art LAI method when trained with generated versus real data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human populations all share a common ancient origin in Africa DeGiorgio et al. (2009), and a common set of variable sites, but correlations between neighboring sites along the genome, which are typically inherited together, vary between sub-populations around the globe Li et al. (2008). These correlations along the genome, known as linkage, influence polygenic risk scores (PRS) Duncan et al. (2019), genome-wide association study (GWAS) results Martin et al. (2017), and many other features of precision medicine. Unfortunately, large portions of the world’s populations have not been included in modern genetic research studies with over 80% of these studies to date including only individuals of European ancestry Popejoy and Fullerton (2016)

. This has serious adverse consequences for the ability of associations learned in these modern studies to be applied to the rest of the world

Duncan et al. (2019). Deconvolving the ancestry of admixed individuals using local-ancestry inference can contribute to filling this gap and understanding the genetic architecture and associations of non-European ancestries; thus allowing the benefits of genomic medicine to accrue to a larger portion of the planet’s population.

Many methods for local-ancestry inference exist and are open-source, HAPAA

Sundquist et al. (2008), HAPMIX Price et al. (2009) and SABER Tang et al. (2006)

infer local-ancestry using Hidden Markov Models (HMMs), LAMP

Sankararaman et al. (2008)

uses probability maximization with a sliding window, and RFMix

Maples et al. (2013)

uses random forests within windows. However, these algorithms all require accessible training data from relevant ancestries in order to recognize those ancestry segments.

The challenge is that many data sets containing human genomic references are proprietary Han et al. (2017); Bryc et al. (2015), protected by privacy restrictions Wojcik et al. (2019), or are otherwise not accessible to the public, especially data sets for under-served or sensitive populations. Generative models that can be easily shared once trained can be useful in such scenarios. While the data sets with their de-anonymizable genome-wide sequences remain securely private, models trained on them could be made publicly available.

In recent years, deep learning has proven effective in solving computer vision and natural language processing problems

LeCun et al. (2015). These methods are being used in the biology, medical and genomics fields Gareau et al. (2017); Lundervold and Lundervold (2019); Li et al. (2018); Eraslan et al. (2019)

. Specifically, deep learning-based generative methods have been increasingly popular in recent years. Generative networks such as Variational Autoencoders (VAEs)

Kingma and Welling (2014) contain a network that encodes the input data into a lower-dimensional space and a decoder that tries to reconstructs the original input. Generative Adversarial Networks (GANs) Goodfellow et al. (2014)

have been able to generate samples that resemble the training data. GANs are able to generate realistic data by using two competing networks: a generator that aims to create realistic new samples and a discriminator that classifies between real and generated samples. Many variants and extensions of GANs and VAEs have been presented recently

Arjovsky et al. (2017); Brock et al. (2019); Sohn et al. (2015).

In this work, we present a class-conditional Variational Autoencoder and Generative Adversarial Network (VAE-GAN) for human genome sequence simulation. The network combines a class-conditional VAE, shown in figure 1, with a class-conditional GAN, shown in figure 2. The network is able to simulate new single-ancestry sequences that resemble the sequences from the training set. The generated sequences are used to train RFMix.

Figure 1: The class-conditional VAE is composed of an encoder-decoder pair. The encoder transforms the input sequence from the ancestry into an embedded representation . The decoder transforms the embedding and ancestry into a reconstruction of the input sequence, .

2 Out-of-Africa Dataset

In this work we use simulated datasets with ancestry data generated from out-of-Africa simulations using msprime Kelleher et al. (2016). This simulation models the origin and spread of humans as a single ancestral population that grew instantaneously into the continent of Africa. This population stayed with a constant size to the present day. At some point in the past, a small group of individuals migrated out of Africa and later split in two directions: some founding the present day European populations, and another founding the present day East Asian populations. Both populations grew exponentially after their separation. The parameters that determine the timing of these events, effective population sizes, and growth rates of European and East Asian populations, are presented in Gravel et al. Gravel et al. (2011).

Following the above out-of-Africa model, we generated three groups of 100 diploid individuals of single-ancestry, one group each of African, European and East Asian ancestry. We divided these 300 simulated individuals into training, validation and testing sets with 240, 30 and 30 diploid individuals respectively. Later, the validation and testing individuals were used to generate admixed descendants using Wright-Fisher forward simulation over a series of generations. From 30 single-ancestry individuals, a total of 100 admixed individuals were generated with the admixture event occurring 8 generations in their past to create both validation and testing sets. The 240 single-ancestry individuals were used to train RFMix and the class-conditional VAE-GAN, and the 200 admixed individuals of the validation and testing sets were used to evaluate RFMix following training. Throughout we use chromosome 20 of each individual for experiments.

3 Network Architecture

Figure 2: The class-conditional GAN is composed of a decoder-discriminator pair. The decoder generates new samples from a Gaussian representation and ancestry . The discriminator separates between out-of-Africa sequences and VAE-GAN generated sequences .

The proposed network splits the genome into fixed-size non-overlapping windows. The SNPs within each window are used as the input for individual class-conditional VAE-GAN’s. The input SNPs are encoded as -1 and 1 for each base-pair. Missing input SNPs are modeled by inputting a 0 in the corresponding position. The VAE-GAN’s are composed of three sub-networks: an encoder, a decoder, and a discriminator. Each sub-network is class-conditional (i.e. the ancestry is an additional input of the network). The encoder-decoder pair forms a VAE (figure 1) while the decoder-discriminator pair forms a GAN (figure 2).

The encoder, , transforms the input SNPs from the given the ancestry

(represented with one-hot encoding) into an isotropic Gaussian embedding space

. The network encodes the input sequence to the embedding space by estimating


. The variance is estimated in a logarithmic form to force

. The embedded representation of a sample from an ancestry can be sampled from . The sampling can be performed with the reparametrization trick: , where and is an element-wise multiplication. The encoder networks begin with an input linear layer of size , where is the window’s size, is the number of ancestries, and

is the size of the hidden layer. Following the first layer, a ReLU non-linearity and batch normalization is used. Then, two linear layers are used with dimensions

, where is the dimension of the embedding space, to estimate and .

The decoder, with a given ancestry and embedded representation , tries to reconstruct the input SNPs . In order to obtain training samples for LAI methods, new sequences can be simulated by selecting the desired ancestry , sampling a random embedding, , and reconstructing the SNP sequence . The decoder networks start with an input layer of size followed by a ReLU non-linearity, batch normalization and the output linear layer of size . The discriminator network is trained to distinguish the real samples from the fake samples . The discriminator networks start with an input layer of size followed by a ReLU non-linearity, batch normalization and the output linear layer of size .

The encoder is trained by minimizing the mean square error between the input and reconstructed sequences and the Kullback-Leibler divergence. The encoder loss function is as follows:


where and are the input and reconstructed sequence respectively, is the dimension of the embedding space, is the th element of and is the th element of the diagonal of . The decoder is trained by minimizing the mean square error of the reconstruction and the adversarial loss:


where is a simulated sequence from a randomly selected ancestry and . In our work we select . The discriminator is trained using binary cross-entropy with real, , and simulated data, :


Because the sequence is generated in a windowed approach, a different ancestry can be assigned to each window, to simulate an admixed individual. However, in this work we focus on single-ancestry individuals. The network is trained to obtain haploid sequences, but by generating pairs of haploid sequences, diploid chromosomes can be simulated. In order to avoid duplicate or very similar individuals, we generate times the number of desired individuals and compute the pair-wise correlations of the generated sequences. Then, we select the individuals with the lowest average correlation. In this paper we use .

4 Experimental Results

We use the single-ancestry out-of-Africa individuals of the training set to train each VAE-GAN. After training the networks, we generate a total of 80 synthetic samples per ancestry and train RFMix. RFMix is then evaluated with the admixed individuals in the validation set. We select the hyper-parameters of the VAE-GAN (window size, hidden layer size and embedding space) and the training parameters (learning rate, batch size and epoch) that provide the highest validation accuracy of RFMix. Finally, we compare the testing accuracy of RFMix when trained with out-of-Africa data versus when trained with data generated with the VAE-GANs. Additionally, we compare the results of including the discriminator and the adversarial loss (VAE-GAN) with only using a VAE.

Table 1 shows that RFMix obtains comparable accuracies when trained with out-of-Africa and data simulated data. Accuracy results show that adding the discriminator and the adversarial loss helps the network to learn to simulate human-chromosome sequences that are more similar to the original training data and therefore more useful to train LAI methods, providing a significant increase in accuracy.

Method RFMix Validation Accuracy RFMix Testing Accuracy
Out-of-Africa Data 97.98% 97.75%
Generated Data (VAE) 93.21% 93.05%
Generated Data (VAE-GAN) 97.58% 97.72%
Table 1: Accuracy of RFMix Maples et al. (2013) trained with real and generated data

5 Conclusions

In this work we show a proof of concept for data generation using VAE-GANs. Such networks show promising results with Out-of-Africa simulated data. Strong simulation methods allow researchers to work infer ancestry using a wide-range of existing tools without the need for having access to real data from sensitive populations, or from proprietary or protected databases. Besides simulation, generative models have the potential to estimate meaningful representations in the embedding space or to be useful tools for data imputation or reconstruction.

Future work includes using real humane-genome sequences to train and evaluate our networks and studying how generative models can be used to help interpret the histories of populations.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein GAN. arXiv preprint arXiv:1701.07875. Cited by: §1.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. Proceedings of the International Conference on Learning Representations. Note: New Orleans, LA Cited by: §1.
  • K. Bryc, E. Y. Durand, J. M. Macpherson, D. Reich, and J. L. Mountain (2015) The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States. American journal of human genetics 96 (1), pp. 37–53. External Links: Document, Link Cited by: §1.
  • M. DeGiorgio, M. Jakobsson, and N. A. Rosenberg (2009) Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa.. Proceedings of the National Academy of Sciences of the United States of America 106 (38), pp. 16057–16062. Note: MS code for serial founder, discussion of effects, references for heterozygosity, LD, and slope of ancestral allele frequency spectrum. External Links: Document, Link Cited by: §1.
  • L. Duncan, H. Shen, B. Gelaye, J. Meijsen, K. Ressler, M. Feldman, R. Peterson, and B. Domingue (2019) Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications 10 (1), pp. 1–9 (English). External Links: Document Cited by: §1.
  • E. Y. Durand, C. B. Do, J. L. Mountain, and J. M. Macpherson (2014) Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution. bioRxiv, pp. 010512 (English). External Links: Document Cited by: Class-Conditional VAE-GAN for Local-Ancestry Simulation.
  • G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis (2019) Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, pp. 389–403. Cited by: §1.
  • D. S. Gareau, J. Correa da Rosa, S. Yagerman, J. A. Carucci, N. Gulati, F. Hueto, J. L. DeFazio, M. Suárez-Fariñas, A. Marghoob, and J. G. Krueger (2017)

    Digital imaging biomarkers feed machine learning for melanoma screening

    Experimental dermatology 26 (7), pp. 615–618. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • S. Gravel, B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs, C. D. Bustamante, 1. G. Project, et al. (2011) Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences 108 (29), pp. 11983–11988. Cited by: §2.
  • E. Han, Y. Wang, P. Carbonetto, R. E. Curtis, J. M. Granka, J. Byrnes, K. Noto, A. R. Kermany, N. M. Myres, M. J. Barber, K. A. Rand, S. Song, T. Roman, E. Battat, E. Elyashiv, H. Guturu, E. L. Hong, K. G. Chahine, and C. A. Ball (2017) Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nature Communications 8, pp. 14238. External Links: Document, ISSN 2041-1723 Cited by: §1.
  • J. Kelleher, A. M. Etheridge, and G. McVean (2016) Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Computational Biology 12 (5), pp. 1–22. Cited by: §2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. Proceedings of the International Conference on Learning Representations, pp. 1–14. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep Learning. Nature 521, pp. 436–444. Cited by: §1.
  • J. Z. Li, D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, S. Ramachandran, H. M. Cann, G. S. Barsh, M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 (5866), pp. 1100–1104 (English). Note: Ancestral allele frequencies in various populations External Links: Document, Link Cited by: §1.
  • X. Li, L. Liu, J. Zhou, and C. Wang (2018) Heterogeneity analysis and diagnosis of complex diseases based on deep learning method. Scientific reports 8 (1), pp. 1–8. Cited by: §1.
  • A. S. Lundervold and A. Lundervold (2019) An overview of deep learning in medical imaging focusing on mri. Zeitschrift für Medizinische Physik 29 (2), pp. 102–127. Cited by: §1.
  • B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Bustamante (2013) RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. The American Journal of Human Genetics 93 (2), pp. 278–288. Cited by: Class-Conditional VAE-GAN for Local-Ancestry Simulation, §1, Table 1.
  • A. R. Martin, M. Lin, J. M. Granka, J. W. Myrick, X. Liu, A. Sockell, E. G. Atkinson, C. J. Werely, M. Möller, M. S. Sandhu, et al. (2017) An unexpectedly complex architecture for skin pigmentation in africans. Cell 171 (6), pp. 1340–1353. Cited by: §1.
  • A. B. Popejoy and S. M. Fullerton (2016) Genomics is failing on diversity. Nature News 538 (7624), pp. 161–164. External Links: Document, Link Cited by: §1.
  • A. L. Price, A. Tandon, N. Patterson, K. C. Barnes, N. Rafaels, I. Ruczinski, T. H. Beaty, R. Mathias, D. Reich, and S. Myers (2009) Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations. PLoS Genetics 5 (6), pp. 1–18. External Links: Document, Link Cited by: §1.
  • S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin (2008) Estimating local ancestry in admixed populations. The American Journal of Human Genetics 82 (2), pp. 290–303. Cited by: §1.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
  • A. Sundquist, E. Fratkin, C. B. Do, and S. Batzoglou (2008) Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome research 18, pp. 676–682. Cited by: §1.
  • H. Tang, M. Coram, P. Wang, X. Zhu, and N. Risch (2006) Reconstructing genetic ancestry blocks in admixed individuals. The American Journal of Human Genetics 79, pp. 1–12. Cited by: §1.
  • G. L. Wojcik, M. Graff, K. K. Nishimura, R. Tao, J. Haessler, C. R. Gignoux, H. M. Highland, Y. M. Patel, E. P. Sorokin, C. L. Avery, G. M. Belbin, S. A. Bien, I. Cheng, S. Cullina, C. J. Hodonsky, Y. Hu, L. M. Huckins, J. Jeff, A. E. Justice, J. M. Kocarnik, U. Lim, B. M. Lin, Y. Lu, S. C. Nelson, S. L. Park, H. Poisner, M. H. Preuss, M. A. Richard, C. Schurmann, V. W. Setiawan, A. Sockell, K. Vahi, M. Verbanck, A. Vishnu, R. W. Walker, K. L. Young, N. Zubair, V. Acuna-Alonso, J. L. Ambite, K. C. Barnes, E. Boerwinkle, E. P. Bottinger, C. D. Bustamante, C. Caberto, S. Canizales-Quinteros, M. P. Conomos, E. Deelman, R. Do, K. Doheny, L. Fernández-Rhodes, M. Fornage, B. Hailu, G. Heiss, B. M. Henn, L. A. Hindorff, R. D. Jackson, C. A. Laurie, C. C. Laurie, Y. Li, D. Lin, A. Moreno-Estrada, G. Nadkarni, P. J. Norman, L. C. Pooler, A. P. Reiner, J. Romm, C. Sabatti, K. Sandoval, X. Sheng, E. A. Stahl, D. O. Stram, T. A. Thornton, C. L. Wassel, L. R. Wilkens, C. A. Winkler, S. Yoneyama, S. Buyske, C. A. Haiman, C. Kooperberg, L. Le Marchand, R. J. F. Loos, T. C. Matise, K. E. North, U. Peters, E. E. Kenny, and C. S. Carlson (2019) Genetic analyses of diverse populations improves discovery for complex traits. Nature 570 (7762), pp. 514–518 (English). External Links: Document Cited by: §1.