A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

by   Ziqi Ke, et al.
The University of Texas at Austin

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin – an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posteriori probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques.


page 1

page 2

page 3

page 4


Real-Time Radio Technology and Modulation Classification via an LSTM Auto-Encoder

Identification of the type of communication technology and/or modulation...

Scalable De Novo Genome Assembly Using Pregel

De novo genome assembly is the process of stitching short DNA sequences ...

Learning to Untangle Genome Assembly with Graph Convolutional Networks

A quest to determine the complete sequence of a human DNA from telomere ...

PANDA: Processing-in-MRAM Accelerated De Bruijn Graph based DNA Assembly

Spurred by widening gap between data processing speed and data communica...

Matrix Completion and Performance Guarantees for Single Individual Haplotyping

Single individual haplotyping is an NP-hard problem that emerges when at...

ComHapDet: A Spatial Community Detection Algorithm for Haplotype Assembly

Background: Haplotypes, the ordered lists of single nucleotide variation...

1 Introduction

Genetic makeup of a biological sample, inferred by means of DNA sequencing, will help determine an individual’s susceptibility to a broad range of chronic and acute diseases, support the discovery of new pharmaceutical products, and personalize and improve the delivery of health care. However, before the promises of personalized medicine come to fruition, efficient methods for accurate inference of genetic variations from massive DNA sequencing data must be devised.

Information about variations in an individual genome is provided by haplotypes, ordered lists of single nucleotide polymorphisms (SNPs) on the individual’s chromosomes (schw2010). High-throughput DNA sequencing technologies generate massive amounts of reads that sample an individual genome and thus enable studies of genetic variations (schw2010,clark2004role,sabe02). Haplotype reconstruction, however, remains challenging due to limited lengths of reads and presence of sequencing errors (hash18). Particularly difficult is the assembly of haplotypes in polyploids, organisms with chromosomes organized in -tuples with , where deep coverage is typically required to achieve desired accuracy. This implies high cost and often renders existing haplotype assembly techniques practically infeasible (motazedi2018exploit).

A closely related problem to haplotype assembly is that of reconstructing viral communities. RNA viruses such as hepatitis, HIV, and Ebola, are characterized by high mutation rates which give rise to communities of viral genomes, the so-called viral quasispecies. Determining genetic diversity of a virus is essential for the understanding of its origin and mutation patterns, and the development of effective drug treatments. Reconstructing viral quasispecies (i.e., viral haplotypes, as we refer to them for convenience) is even more challenging than haplotype assembly (ahn2018viral) since the number of constituent strains in a community is typically unknown, and its spectra (i.e., strain frequencies) non-uniform.

11footnotetext: This work was funded in part by the NSF grant CCF 1618427.

Existing methods often approach haplotype assembly as the task of grouping sequencing reads according to their chromosomal origin into as many clusters as there are chromosomes. Separation of reads into clusters is rendered challenging by their limited lengths and the presence of sequencing errors (hash18); such artifacts create ambiguities regarding the origin of the reads. The vast majority of existing haplotype assembly methods attempt to remove the aforementioned ambiguity by altering or even discarding the data, leading to minimum SNP removal (lanc01), maximum fragments cut (duit10), and minimum error correction (MEC) score (lipp02) optimization criteria. Majority of haplotype assembly methods developed in recent years are focused on optimizing the MEC score, i.e., determining the smallest possible number of nucleotides in sequencing reads that should be altered such that the resulting dataset is consistent with having originated from haplotypes (

denotes the ploidy of an organism) (xie16,piro15,kule14,patt15,boni16). These include the branch-and-bound scheme (wang05), an integer linear programming formulation in (chen13), and a dynamic programming framework in (kule14). All these techniques attempt to find exact solution to the MEC score minimization problem; the resulting high complexity has motivated search for computationally efficient heuristics. They include the greedy algorithm in (levy07) and methods that compute posterior joint probability of the alleles in a haplotype sequence via MCMC (bans08) and Gibbs (kim07) sampling. A max-cut algorithm for haplotype assembly in (bans08) is motivated by the clustering interpretation of the problem. The efficient algorithm proposed there, HapCUT, has recently been upgraded as HapCUT2 (edge17). In (agui12), a novel flow-graph approach to haplotype assembly was proposed, demonstrating performance superior to state-of-the-art methods. More recent methods include a greedy max-cut approach in (duit11), convex optimization framework in (das15), and a communication-theoretic motivated algorithm in (pulj16).

Haplotype assembly for polyploids () is more challenging than that for diploids () due to a much larger space of possible solutions to be searched. Among the aforementioned methods, only HapCompass (agui12), SDhaP (das15) and BP (pulj16) are capable of solving the haplotype assembly problem for . Other techniques that can handle reconstruction of haplotypes for both diploid and polyploid genomes include a Bayesian method HapTree (berg14), a dynamic programming method H-PoP (xie16) shown to be more accurate than the techniques in (agui12,berg14,das15), and the matrix factorization schemes in (cai16,hash18).

On another note, a number of viral quasispecies reconstruction methods were proposed in recent years. Examples include ShoRAH (zagordi2011shorah) and ViSpA (astrovskaya2011inferring) that perform read clustering and read-graph path search, respectively, to identify distinct viral components. QuasiRecomb (topfer2013probabilistic) casts the problem as the decoding in a hidden Markov model while QuRe (prosperi2012qure) formulates it as a combinatorial optimization. PredictHaplo (prabhakaran2014hiv) employs non-parametric Bayesian techniques to automatically discover the number of viral strains in a quasispecies. More recently, aBayesQR (ahn2017abayesqr) approached viral quasispecies reconstruction with a combination of hierarchical clustering and Bayesian inference while (ahn2018viral) relies on tensor factorization.

In this paper, we propose a first ever neural network-based learning framework, named GAEseq, to both haplotype assembly and viral quasispecies reconstruction problems. The framework aims to estimate the posterior probabilities of the origins of sequencing reads using an auto-encoder whose design incorporates salient characteristics of the sequencing data. Auto-encoders (Fukushima1975) are neural networks that in an unsupervised manner learn a low-dimensional representation of data; more specifically, they attempt to perform a dimensionality reduction while robustly capturing essential content of high-dimensional data (goodfellow2016deep). Auto-encoders have shown outstanding performance in a variety of applications across different fields including natural language processing (richard2011), collaborative filtering (rianne2017graph), and information retrieval (thomas2016variational), to name a few. Typically, auto-encoders consist of two blocks: an encoder and a decoder. The encoder converts input data into the so-called codes while the decoder reconstructs the input from the codes. The act of copying the input data to the output would be of little interest without an important additional constraint – namely, the constraint that the dimension of codes is smaller than the dimension of the input. This enables auto-encoders to extract salient features of the input data. For both the single individual and viral haplotype reconstruction problems, the salient features of data are the origins of sequencing reads. In our work, we propose a graph auto-encoder architecture with an encoder featuring a softmax function placed after the dense layer that follows graph convolutional layers (Masci2011; rianne2017graph); the softmax function acts as an estimator of the posterior probabilities of the origins of sequencing reads. The decoder assembles haplotypes by finding the consensus sequence for each component of the mixture, thus enabling end-to-end solution to the reconstruction problems.

2 Methods

2.1 Problem formulation

Let denote a haplotype matrix where is the number of (single individual or viral) haplotypes and is the haplotype length. Furthermore, let denote an SNP fragment matrix whose rows correspond to sequencing reads and columns correspond to SNP positions. Matrix is formed by first aligning reads to a reference genome and then identifying and retaining only the information that the reads provide about heterozygous genomic sites. One can interpret as being obtained by sparsely sampling an underlying ground truth matrix , where the row of is the haplotype sampled by the read. The sampling is sparse because the reads are much shorter than the haplotypes; moreover, the reads may be erroneous due to sequencing errors. Following (ahn2018viral) , we formalize the sampling operation as


where denotes the set of informative entries in , i.e., the set of such that the SNP is covered by the read, and is the projection operator denoting the sampling of haplotypes by reads. Sequencing is erroneous and thus may differ from ; in particular, given sequencing error rate , with probability .

Since each read samples one of the haplotypes, where denotes the matrix indicating origins of the reads in . In particular, each row of matrix is one of the

-dimensional standard unit vectors

, with in the position and the remaining entries . If read samples haplotype, the row of is . If the origins of reads were known, each haplotype could be reconstructed by finding consensus of reads which sample that particular haplotype. We think of the assembly as a two-step procedure: given the SNP fragment matrix we first identify the read origin indicator matrix and then use to reconstruct the haplotype matrix .

To characterize the performance of haplotype assembly methods we rely on two metrics: the minimum error correction (MEC) score, which can be traced back to (lippert2002algorithmic), and the correct phasing rate, also referred to as reconstruction rate. The MEC score is defined as the smallest number of observed entries in that need to be altered (i.e., corrected) such that the resulting data is consistent with having originated from distinct haplotypes, i.e.,


where HD denotes the Hamming distance between its arguments (sequences, evaluated only over informative entries), denotes the row of and denotes the row of . The correct phasing rate (CPR) is defined as


where is the one-to-one mapping from the set of reconstructed haplotype to the set of true haplotype (hash18), i.e., mapping that determines the best possible match between the two sets of haplotypes. To characterize performance of methods for reconstruction of viral quasispecies with generally a priori unknown number of components, in addition to correct phasing rate we also quantify recall rate, defined as the fraction of perfectly reconstructed components in a population (i.e., recall rate = ), and predicted proportion, defined as the ratio of the estimated and the true number of components in a genomic mixture (ahn2018viral).

To assemble haplotypes from a set of reads we design and employ a graph auto-encoder. Fig. 1 (b) shows the entire end-to-end pipeline that takes the collection of erroneous reads and generates reconstructed haplotypes. First, the SNP fragment matrix is processed by the graph encoder to infer the read origin indicator matrix ; then, a haplotype decoder reconstructs matrix . The graph auto-encoder is formalized in the next section.

Figure 1: (a) Segment of the SNP fragment matrix. Non-zero entries represent SNP information provided by sequencing reads; labels - indicate the four nucleotides. Zero entries in a row indicate that the read does not cover corresponding SNP. In this illustration, the first two rows represent reads originating from the same haplotype; the third and fourth reads both originated from another haplotype; and so on. (b) The pipeline from the SNP fragment matrix to haplotypes via a graph auto-encoder.

2.2 Graph auto-encoders

Graph auto-encoders are a family of auto-encoders specifically designed for learning on graph-structured data (rianne2017graph; thomas2016variational). In this paper, we design graph auto-encoders for the assembly of the components of a genomic mixture. As in conventional auto-encoder structures, the developed architecture consists of two parts: the graph encoder and the decoder. The graph encoder takes the SNP fragment matrix and the graph adjacency matrix as inputs, and outputs the node embedding matrix

. Note that we impose constraints on the node embedding matrix so that the salient features extracted by a graph auto-encoder approximate the read origin indicator matrix

. Such a constraint does not prevent efficient training of the auto-encoders via backpropagation. The decoder

is utilized to reconstruct the SNP fragment matrix and the haplotype matrix from the node embedding matrix

; this implies that the decoder is essentially capable of imputing the unobserved entries in the SNP fragment matrix.

To numerically represent information in the SNP fragment matrix , we encode its entries using a set of discrete values – one for each of possible nucleotides – where the mapping between nucleotides and the discrete values can be decided arbitrarily. To this end, we may simply represent the nucleotides A, C, G and T by , , and , respectively; non-informative entries in each row of , i.e., SNP positions not covered by a read, are represented by . Note that the SNP fragment matrix can be represented by an undirected bipartite graph where the set of read nodes with and the set of SNP nodes with together form the set of vertices , i.e., . The weights assigned to edges are the discrete values used to represent nucleotides. With this model in place, we can rephrase the graph encoder as , where represents the graph adjacency matrix for a nucleotide encoded by . Equivalently, has 1’s for the entries whose corresponding positions in are encoded by . Since we are interested in imputing the unobserved entries based on the observed entries in instead of simply copying the observed entries to , it is beneficial to reformulate the decoder as . In other words, the auto-encoder is trained to learn from the observed entries in order to determine origin of reads, impute unobserved entries of , and reconstruct haplotypes in the genomic mixture.

2.3 Read origin detection via graph encoder

Recall the interpretation that the SNP fragment matrix is obtained by erroneously sampling an underlying ground truth matrix . This motivates development of a specific graph encoder architecture, motivated by the ideas of the design in (rianne2017graph), that is capable of detecting origin of sequencing reads in via estimating the posterior probabilities of the origin of each read.

Let denote an diagonal read degree matrix whose entries indicate the number of SNPs covered by each read, and let denote an diagonal SNP degree matrix whose entries indicate the number of reads covering each SNP. We facilitate exchange of messages between read nodes and SNP nodes in the graph, initiating it from the set of read nodes ; doing so helps reduce the dimensions of weights and biases since the number of reads is far greater than the haplotype length . Note that the dimension of messages keeps reducing during the message passing procedure.

Figure 2: A forward pass through the graph auto-encoder consisting of a stacked graph encoder that passes messages between read and SNP nodes and constructs approximate read origin indicator matrix via the softmax function. Decoder reconstructs haplotypes and SNP fragment matrix.

The messages from read nodes to SNP nodes are


where and denote the weights and biases of the first convolutional layer for the nucleotide encoded with , respectively,

denotes an element-wise activation function such as

, and denotes the transpose of a matrix. The dimension of both and is , where denotes the message length after the first message passing step.

The messages from SNP nodes to read nodes are


where and denote the weights and biases of the second convolutional layer for the nucleotide encoded with , respectively. The dimension of both and is , where denotes the message length after the second message passing step.

Repeating message passing and stacking the convolutional layers leads to formation of a deep model. The read nodes to SNP nodes layer is readily generalized as


where and for . The dimension of is . Furthermore, the SNP nodes to read nodes layer is generalized as


where . The dimension of is . Note that the messages are passed from read nodes to SNP nodes when the subscript of

is odd, and otherwise traverse in the opposite direction.

Equation (6) and (7) specify the graph convolutional layer while the dense layer is defined as


where denotes the output of the dense layer, and are the weights and biases of the dense layer, respectively, is the output of the last graph convolutional layer, and represents the number of graph convolutional layers. The dimension of is and the dimension of and is , where denotes the ploidy (i.e., the number of components in a genomic mixture).

To find which approximates the read origin indicator matrix (i.e., with each row close in the -norm sense to a -dimensional standard basis vector), we employ the softmax function


where in our experiments we set to 200. Having estimated read origins by the node embedding matrix , the reads can be organized into clusters. This enables straightforward reconstruction of haplotypes by determining the consensus sequence for each cluster.

2.4 Haplotype decoder

Thus far, we have conveniently been representing alleles as the numbers in

. It is desirable, however, that in the definition of a loss function the distance between numerical values representing any two alleles is identical, no matter which pair of alleles is considered; this ensures the loss function relates to the MEC score – the metric of interest in haplotype assembly problems. Following (ahn2018viral), we define the loss function of the auto-encoder as the squared Frobenius norm of the difference between a one-hot SNP fragment matrix

and the reconstructed matrix at the informative positions, i.e., , where and are formed by substituting discrete values by the set of four dimensional standard basis vectors . With such a notational convention, the proposed loss function approximates the MEC score; it only approximates the score, rather than coincides with it, because is an approximation of the read-origin matrix . Therefore, the graph auto-encoder is trained to approximately minimize the MEC score. Fig. 2 illustrates the data processing pipeline that takes as inputs reads in the SNP fragment matrix and produces the matrix of haplotypes as well as imputes missing entries in the SNP fragment matrix. The proposed graph auto-encoders for haplotype assembly and viral quasispecies reconstruction are formalized as Algorithm 1 and Algorithm 2, respectively. For the viral quasispecies reconstruction problem, the number of clusters is typically unknown; detailed strategy based on (ahn2018viral) for the automated inference of can be found in Supplementary Document B.

1:Input: SNP fragment matrix , the number of experiments and the number of haplotpyes
2:Output: Reconstructed haplotypes
3:while  do
4:     Initialize , , and using Xavier initialization where and
5:     for  to  do
9:           with
10:          Calculate by majority voting
12:          Record reconstructed haplotypes and the MEC score
13:          Update , , and using Adam Optimizer where and
14:     end for
16:end while
17:Output the reconstructed haplotypes corresponding to the lowest MEC score
Algorithm 1 Graph auto-encoder for haplotype assembly
1:Input: SNP fragment matrix , the number of experiments , the MEC improvement rate threshold and the estimated initial number of components
2:Output: Reconstructed viral haplotypes and the inferred frequencies
3:Initial , MECflag and
4:while  or  do
5:     for  do
6:         Run Algorithm 1 with
7:     end for
8:     if MECimpr then
9:         ; MECflag
10:     else
11:         if MECflag  then
13:         else
15:         end if
16:     end if
18:end while
19:Output the viral quasispecies with and the inferred frequencies
Algorithm 2 Graph auto-encoder for viral quasispecies reconstruction

3 Results

The hyper-parameters of GAEseq are determined by training on 5 synthetic triploid datasets with coverage 30 and validated on different 5 synthetic triploid datasets with the same coverage. The results reported in this section are obtained on test data. Detailed description of the computational platform and the choice of hyper-parameters can be found in Supplementary Document A.

Coverage Mean SD Mean SD
15 GAEseq 8.200 4.686 0.822 0.048
HapCompass 100.700 66.150 0.763 0.046
H-PoP 28.700 32.667 0.783 0.066
AltHap 59.100 28.125 0.709 0.054
25 GAEseq 8.400 4.719 0.831 0.081
HapCompass 124.800 132.156 0.810 0.063
H-PoP 33.800 47.434 0.798 0.046
AltHap 92.600 83.649 0.756 0.068
35 GAEseq 10.700 3.234 0.857 0.087
HapCompass 217.400 174.135 0.775 0.072
H-PoP 41.700 53.971 0.823 0.094
AltHap 164.000 101.583 0.754 0.093
Table 1: Performance comparison on biallelic Solanum Tuberosum semi-experimental data.
Figure 3: The precision-recall curves for Solanum Tuberosum semi-experimental data with coverage 15, 25 and 35
p17 p24 p2-p6 PR RT RNase int vif vpr vpu gp120 gp41 nef
GAEseq PredProp 1 1 1 1 1.2 1 1 1 1 1.2 1 1 1
CPR 100 99.4 100 100 100 100 100 100 100 100 96.2 96.7 100
CPR 100 99.4 100 100 100 100 100 100 100 99.2 99.4 100 98.2
CPR 100 100 100 100 100 100 100 100 100 100 99.9 100 99.3
CPR 100 100 100 100 100 100 100 100 100 100 100 100 99.8
CPR 100 100 100 100 100 100 100 100 100 100 99.6 100 98.1
PredictHap PredProp 1 0.6 1 1 1 0.8 0.8 0.8 1 0.8 0.8 0.8 0.8
CPR 100 0 100 100 100 98.9 100 100 100 93.2 0 0 0
CPR 100 100 100 100 100 100 99.8 100 100 0 97.8 100 98.8
CPR 100 100 100 100 100 100 100 100 100 100 99.7 100 100
CPR 100 99.1 100 100 100 100 100 100 100 100 100 100 100
CPR 100 0 100 100 100 0 0 0 100 100 98.6 100 100
TenSQR PredProp 1 1.6 1 1 1.4 1 1 1 1 1.6 2.2 1.2 0.8
CPR 100 98.9 100 100 99.2 100 100 100 100 92.8 96.0 99.0 0
CPR 100 100 100 100 98.0 100 100 100 100 94.0 97.2 100 95.7
CPR 100 100 100 100 100 100 100 100 100 100 98.3 97.7 99.8
CPR 100 99.3 100 100 99.5 100 100 100 100 100 99.8 99.5 99.7
CPR 100 99.3 100 99.7 99.7 100 100 100 100 100 94.9 100 98.6
aBayesQR PredProp 1 1 1 1 1 1 1 1 1.2 1 0.8 0.8 1.2
CPR 100 99.4 100 100 98.5 100 99.9 100 100 99.6 98 0 95.8
CPR 100 98.7 100 100 98.6 100 100 100 100 92 96.5 98.9 95.5
CPR 100 99.6 100 100 99 100 100 100 100 98.8 97.7 99.1 98.2
CPR 100 100 100 100 98.9 100 100 99.8 100 100 96.3 98.8 100
CPR 100 99.7 100 100 99.2 100 99.5 99.7 100 100 0 98.6 99.2
Table 2: Performance comparison of GAEseq, PredictHap, TenSQR and aBayesQR on a real HIV-1 5-virus-mix data. Genes where all the strains are perfectly reconstructed are denoted as boldface.

3.1 Performance comparison on biallelic Solanum Tuberosum semi-experimental data

We first evaluate performance of GAEseq on realistic simulations which, for convenience and to distinguish from perhaps more rich synthetic and experimental data discussed in supplementary documents, we refer to as ”semi-experimental data”. The semi-experimental data is obtained by simulating mutations, shotgun sequencing procedure, read alignment and SNP calling steps in a fictitious experiment on a single individual Solanum Tuberosum (polyploid with

). Details on how exactly the semi-experimental data is generated and processed can be found in Supplementary Document C. We compare the performance of GAEseq on this data with publicly available software HapCompass (agui12), an algorithm that relies on graph-theoretic models to perform haplotype assembly, H-PoP (xie16), a dynamic programming method, and AltHap (hash18), a method based on tensor factorization. The performance of different methods is evaluated in terms of the MEC score and CPR. All the considered softwares were executed with their default settings, i.e. we follow instructions in the papers they were originally proposed; there are no parameter tuning steps required for these methods. We report the MEC scores and CPR achieved by the considered algorithms in Table 1. For each sequencing coverage, the mean and standard deviation (SD) of the adopted metrics are evaluated over 10 samples. As shown in the table, GAEseq achieves the lowest average MEC score as well as the lowest standard deviation of the MEC score at all sequencing coverage settings. Moreover, GAEseq achieves the highest average CPR at all coverage settings. Note that the MEC score increases with sequencing coverage since higher coverage implies more reads. The results demonstrate that the adopted graph abstraction enables GAEseq to achieve high accuracy of the reconstruction task by learning posterior probabilities of the origins of reads. Fig. 3 shows the precision-recall curves for data with coverage 15

, 25 and 35. Note that GAEseq performs very accuratly at high sequencing coverage while its performance deteriorates at low coverage. An extended version of Table 1 with additional coverage settings is in Supplementary Document C.

We further test the performance of GAEseq on simulated biallelic diploid, polyallelic triploid and tetraploid data, and on real Solanum Tuberosum data; in addition to H-Pop, AltHap and HapCompass, comparisons on diploid data also include performance of HapCUT2 (edge17). GAEseq outperforms all the considered algorithms by achieving lower MEC score and higher CPR. Further details can be found in Supplementary Document D and E.

3.2 Performance comparison on gene-wise reconstruction of real HIV-1 data

The real HIV-1 data with pairwise distances between and relative frequencies between and is an in vitro viral population of 5 known HIV-1 strains generated by Illumina’s MiSeq Benchtop Sequencer (di2014full). These reads are then aligned to the HIV- reference genome. According to (di2014full), we remove reads of length lower than 150bp and mapping quality scores lower than 60 for better results. We compare the performance of GAEseq on gene-wise reconstruction of the HIV population to that of other state-of-the-art methods such as PredictHaplo (prabhakaran2014hiv), TenSQR (ahn2018viral) and aBayesQR (ahn2017abayesqr), following their default settings. For fair benchmarking, we use the same dataset as (ahn2018viral) which is why the results of our benchmarking tests match those in (ahn2018viral). The correct phasing rate and the inferred strain frequencies are evaluated for all reconstructed strains because the ground truth for the 5 HIV-1 strains is available at (https://bmda.dmi.unibas.ch/software.html). Following (ahn2018viral), we evaluate predicted proportion by setting the parameter needed to detect the number of HIV-1 strains to . The results in Table 2 show that GAEseq perfectly reconstructs all 5 HIV-1 strains in 8 genes while other methods correctly reconstruct components in 5 or 6 genes. This demonstrates that GAEseq’s inference of read origins based on posterior probabilities enables high accuracy of the reconstruction tasks. Regarding the 5 genes where GAEseq and other methods do not achieve perfect reconstruction (p24, vpu, gp120, gp41, nef): closer examination of viral strains reconstructed by various methods suggests translocations of short viral segments within those 5 genes in the “gold standard” dataset created by (di2014full). Those short translocation cause mismatch between the actual ground truth and the sequences (di2014full) generated. Further results on reconstruction of HIV viral communities can be found in Supplement Document F.

4 Conclusions

In this article, we introduce auto-encoders to the problem of reconstructing components of a genomic mixture from high-throughput sequencing data that is encountered in haplotype assembly and analysis of viral communities. In particular, a graph auto-encoder is trained to group together reads that originate from the same component of a genomic mixture and impute missing information in the SNP fragment matrix by learning from the available data. The graph convolutional encoder attempts to discover origin of the reads while the decoder aims to reconstruct haplotypes and impute missing information, effectively correcting sequencing errors. Studies on semi-experimental data show that GAEseq can achieve significantly lower MEC scores and higher CPR than the competing methods. Benchmarking tests on simulated and experimental data demonstrate that GAEseq maintains good performance even at low sequencing coverage. Studies on real HIV-1 data illustrate that GAEseq outperforms existing state-of-the-art methods in viral quasispecies reconstruction.


  • [Schwartz 2010] Schwartz, R. 2010. Theory and algorithms for the haplotype assembly problem. Communications in Info. & Sys. vol. 10, no. 1, 2010, 23–38.
  • [Clark 2004] Clark, A. G. 2004. The role of haplotypes in candidate gene studies. Genet Epidemiol. vol. 27, no. 4, 2004, 321–33.
  • [Sabeti 2002] Sabeti, P.; Reich, D.; Higgins, J.; Levine, H.; and Richter, D. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419(6909):832–37, 2002.
  • [Lancia 2001] Lancia, G.; Bafna, V.; Istrail, S.; Lippert, R.; and Schwartz, R. 2001. SNPs problems, complexity, and algorithms. European symposium on algorithms, vol. 1, Springer: 2001.182–93.
  • [Duitama 2010] Duitama, J.; Huebsch, T.; McEwen, G.; Suk, E.; and Hoehe, MR. 2010. Refhap: a reliable and fast algorithm for single individual haplotyping. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM: 2010. 160–169.
  • [Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S. 2002 Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Brief Bioinformatics, 2002, 3(1):23–31.
  • [Xie 2016] Xie, M.; Wu, Q.; Wang, J.; and Jiang, T. 2016. H-PoP and H-PoPG: Heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics, 2016, 32(24):3735–44.
  • [Pirola 2015] Pirola, Y.; Zaccaria, S.; Dondi, R.; Klau, G.; Pisanti, N.; and Bonizzoni, P. 2015. Hapcol: accurate and memory-efficient haplotype assembly from long reads. Bioinformatics, 2015, 32(11), 1610–1617.
  • [Kuleshov 2014] Kuleshov, V. 2014. Probabilistic single-individual haplotyping. Bioinformatics, 2014, 30(17):379–85.
  • [Patterson 2015] Patterson, M.; Marschall, T.; Pisanti, N.; Van, Iersel L.; and Stougie, L. 2015. Whatshap: weighted haplotype assembly for future-generation sequencing reads. J Comput Biol., 2015, 22(6):498–509.
  • [Bonizzoni 2016] Bonizzoni, P.; Dondi, R.; Klau, G.; Pirola, Y.; Pisanti, N.; and Zaccaria, S. 2016. On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J Comput Biol., 2016, 23(9):718–36.
  • [Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res., 2017; 27(5):801–12.
  • [Aguiar 2012] Aguiar, D., and Istrail, S. 2012. Hapcompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J Comput Biol. 2012; 19(6):577–90.
  • [Duitama 2011] Duitama, J.; McEwen, G.; Huebsch, T.; Palczewski, S.; and Schulz, S. 2011. Fosmid-based whole genome haplotyping of a hapmap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res., 2011, 40(5), 2041–2053.
  • [Das 2015] Das, S.; and Vikalo, H. 2015. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics, 2015; 16(1):260.
  • [Puljiz 2016] Puljiz, Z.; and Vikalo, H. 2016. Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM Trans Comput Biol Bioinform (TCBB), 2016; 13(3):518–30.
  • [Berger 2014] Berger, E.; Yorukoglu, D.; Peng, J.; and Berger, B. 2014. Haptree: A novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput Biol. 2014; 10(3):e1003502.
  • [Ahn 2018] Ahn, S.; Ke, Z.; and Vikalo, H. (2018). Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics (Oxford, England), 34(13), i23–i31.
  • [Wang 2005] Wang, R.; Wu, L.; Li, Z.; and Zhang, X. 2005. Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics, 2005, 21(10):2456–2462.
  • [Chen 2013] Chen, Z.; Deng, F.; and Wang, L. 2013. Exact algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 2013, 29(16):1938–1945.
  • [Levy 2007] Levy, S.; Sutton, G.; Ng, P.; Feuk, L.; and Halpern, A. 2007 The diploid genome sequence of an individual human. PLoS Biol., 2007, 5(10):254.
  • [Bansal 2008] Bansal, V.; Halpern, A.; Axelrod, N.; and Bafna, V. 2008. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res., 2008; 18(8):1336–46.
  • [Kim 2007] Kim, J.; Waterman M.; and Li, L. 2007. Diploid genome reconstruction of ciona intestinalis and comparative analysis with ciona savignyi. Genome Res. 2007; 17(7):1101–10.
  • [Goodfellow 2016] Goodfellow I.; Bengio, Y.; and Courville A. 2016. Deep Learning. MIT Press.
  • [Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S. 2002. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, 3(1), 23–31.
  • [Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and de Ridder, D. 2018. Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Briefings in bioinformatics, 19(3), 387–403.
  • [Cai 2016] Cai, C.; Sanghavi, S.; and Vikalo, H. 2016. Structured low-rank matrix factorization for haplotype assembly. IEEE J Sel Top Sign Proc. 2016; 10(4):647–57.
  • [Hashemi 2018] Hashemi, A.; Zhu, B.; and Vikalo, H. 2018. Sparse tensor decomposition for haplotype assembly of diploids and polyploids. BMC Genomics, 2018; 19(4), 191.
  • [Rianne 2017] Rianne van den Berg, Thomas N. Kipf and Max Welling 2017. Graph Convolutional Matrix Completion. arXiv:1706.02263.
  • [Thomas 2016] Kipf, Thomas N. and Max Welling 2016. Variational Graph Auto-Encoders. CoRR, abs/1611.07308.
  • [Di 2014] Di, F.; Töpfer, A.; Rey, M.; Prabhakaran, S.; Duport, Y.; Leemann, C.; Schmutz, S.; Campbell, N. K.; Joos, B.; and Lecca, M. R. 2014. Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations. Nucleic acids research, 42(14), e115.
  • [Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesian method for reconstruction of viral populations characterized by low diversity. In International Conference on Research in Computational Molecular Biology, pages 353–369. Springer.
  • [Zagordi 2011] Zagordi, O.; Bhattacharya, A.; Eriksson, N.; and Beerenwinkel, N. 2011. Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC bioinformatics, 12(1), 119.
  • [Astrovskaya 2011] Astrovskaya, I.; Tork, B.; Mangul, S.; Westbrooks, K.; Măndoiu, I.; Balfe, P.; Zelikovsky, A. 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC bioinformatics, 12(6), 1.
  • [Prosperi 2012] Prosperi, M. C; and Salemi, M. 2012. Qure: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics, 28(1), 132–133.
  • [Prabhakaran 2014] Prabhakaran, S.; Rey, M.; Zagordi, O.; Beerenwinkel, N.; and Roth, V. 2014. Hiv haplotype inference using a propagating dirichlet process mixture model. IEEE/ACM Trans. on Comput. Biol. Bioinform. (TCBB), 11(1), 182–191.
  • [Töpfer 2013] Töpfer, A.; Zagordi, O.; Prabhakaran, S.; Roth, V.; Halperin, E.; and Beerenwinkel, N. 2013. Probabilistic inference of viral quasispecies subject to recombination. Journal of Computational Biology, 20(2), 113–123.
  • [Socher 2011] Socher, R.; Pennington, J.; Huang, Eric H.; Ng, Andrew Y.; and Manning, Christopher D. 2011.

    Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions.

    Conference on Empirical Methods in Natural Language Processing, EMNLP(11), 151-161.
  • [Fukushima 1975] Fukushima K. C. 1975. a self-organizing multilayered neural network. Biol Cybern, 20(3-4), 121-36.
  • [Masci 2011] Masci J.; Meier U.; and Ciresan D. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction.

    Artificial Neural Networks and Machine Learning – ICANN

    , 52–9.


  • [Diederik 2015] Diederik P. Kingma and Jimmy Ba 2015. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
  • [Xavier 2010] Xavier Glorot and Yoshua Bengio 2010. Understanding the difficulty of training deep feedforward neural networks.

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , 249-256.
  • [Ahn 2018] Ahn, S.; Ke, Z.; and Vikalo, H. (2018). Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics (Oxford, England), 34(13), i23–i31.
  • [Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesian method for reconstruction of viral populations characterized by low diversity. In International Conference on Research in Computational Molecular Biology, pages 353–369. Springer.
  • [Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and de Ridder, D. 2018. Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Briefings in bioinformatics, 19(3), 387–403.
  • [Potato Genome Sequencing Consortium 2011] Potato Genome Sequencing Consortium 2011. Genome sequence and analysis of the tuber crop potato. Nature, 475, 189–195.
  • [Huang 2012] Huang, W.; and Li, L.; Myers, J. R. and Marth, G. T. 2012. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593–594.
  • [Li 2009] Li, H.; and Durbin, R. 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14), 1754–1760.
  • [Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res., 2017; 27(5):801–12.