Tumors contain multiple, genetically diverse subclonal populations of cells that have evolved from a single progenitor population through successive waves of expansion and selection [nowell1976, Gerlinger12, hughes2014clonal]. Reconstructing their evolutionary histories can help identify characteristic driver mutations associated with cancer development and progression [Hanahan00, Hanahan11]; and can provide insight into how tumors might respond to treatment [aparicio2013implications, bedard2013tumour]. In some cases, it is possible to genotype the subpopulations present in a tumor, while reconstructing its history, using the population frequencies of mutations that distinguish these subclonal populations [Mullighan08, Navin10, Marusyk10, Gerlinger12, Schuh12, Shah12, Carter13, Landau13, theta, trap, phylosub, pyclone, expands, somatica, purbayes]. Increasingly, tumors are being characterized using whole genome sequencing (WGS) of bulk tumor samples [pcawg] and few automated methods exist to reliably perform this reconstruction on the basis of these data.
Subclonal reconstruction algorithms attempt to infer the population structure of heterogeneous tumors based on the measured variant allelic frequency (VAF) of their somatic mutations. Some methods perform this reconstruction based solely on single nucleotide variants or small indels (collectively known as simple somatic mutations or SSMs) [trap, pyclone, phylosub, expands, ding2012clonal, purbayes]. Others use changes in read coverage to identify genomic regions with an average ploidy that differs from normal which they explain using inferred copy number variations (CNVs) affecting some of the cells in the sample [theta, absolute, cnanorm, somatica].
The low read depth of current WGS complicates subclonal reconstruction. Until recently, subclonal populations (i.e., subpopulations
) were defined based on accurate estimates of the proportion of cells with each mutation (i.e., theirpopulation frequency) which, for individual SSMs, are only available through targeted resequencing where the read depths are orders of magnitude higher than typical WGS depths[phylosub, pyclone, ding2012clonal]. However, preliminary evidence suggests that the much larger number of mutations detected by WGS can compensate for their decreased read depth [nik12]. In contrast, CNVs affect large, multi-kilobase or megabase-sized regions of the genome allowing the average copy number of these regions to be accurately estimated with WGS. Unfortunately, CNV-based subclonal reconstruction is more difficult than SSM-based reconstruction because of the need to simultaneously estimate population frequency and new copy number for each CNV. Most CNV-based methods only attempting to infer the copy number status of the clonal cancerous population [absolute, cnanorm] that contains the mutations shared by all of the cancerous cells. The few CNV-based methods [theta, somatica] that attempt to resolve more than one cancerous subpopulation are practically limited to a small number (often two) of subpopulations. In contrast, SSM-based methods applied to targeted resequencing data can reliably resolve many more cancerous subpopulations [phylosub, pyclone, trap, ding2012clonal]. However, it remains unclear what the limits of WGS-based automated subclonal reconstruction are.
Another open question is how to combine CNVs and SSMs when doing reconstruction. CNVs overlapping SSMs can interfere with SSM-based reconstruction because they complicate the relationship between VAF and population frequency. Although some methods attempt to model the impact of CNVs on the allele frequency of overlapping SSMs [expands, pyclone, phylosub, cloneHD], these methods have significant restrictions. For example, several of these methods [pyclone, phylosub] make the unrealistic assumption that every cell either contains the structural variation and the mutation or neither. Also, no method places structural variations in a phylogenetic tree, which is important for studying the evolution of cancerous genomes.
We describe PhyloWGS, the first method designed for complete subclonal phylogenic reconstruction of both CNVs and SSMs from whole genome sequencing of bulk tumor samples. Unlike all previous methods, PhyloWGS appropriately corrects SSM population frequencies in regions overlapping CNVs and is fast enough to perform reconstruction of at least five cancerous subpopulations based on thousands of mutations. We present results on subclonal reconstruction problems that cannot be correctly reconstructed using previous methods. We also probe the relationship between WGS read depth and the number of subpopulations that PhyloWGS can recover. Finally, we demonstrate that even in the absence of reliable CNV estimates, it is still feasible to perform automated subclonal composition reconstruction based on SSM frequency data at typical WGS read depths (30-50x), even for highly rearranged genomes where less than of the SSMs lie in regions of normal copy number.
2 Previous work
Figure 1 provides an overview of an evolving tumor, measuring of VAFs, and the resulting subclonal reconstruction process.
Panel (i) of this figure shows a visualization of the evolution of a tumor over time as noncancerous cells (subpopulation A, shown in grey) are replaced by, at first, one clonal cancerous population (subpopulation B, shown in green) which then further develops into multiple cancerous subpopulations (C and D, shown in blue and yellow, respectively). Tumor cells define new subpopulations by acquiring new oncogenic mutations that allow their descendants to expand relative to the other tumor subpopulations. Each circle in Panel (i) refers to a subpopulation. We associate subpopulations with the set of shared somatic mutations that distinguish it from its parent subpopulation (or, in the case of A, from the germline (or reference) genome); this mutation set is indicated by the corresponding lower case letter (e.g. mutation set b first appears in subpopulation B). However, each subpopulation also inherits all of its parent’s mutations; the subclonal lineage of a mutation is the set of all subpopulations that contain it (e.g., the subclonal lineage of a is A, B, C, and D).
In general, the subpopulation-defining mutation sets include more than one mutation. Cancerous cells often have increased mutation rates, and even noncancerous cells accumulate somatic mutations at a rate of per cell division [behjati2014genome]. As such, subpopulations are defined not only by the small number of oncogenic ‘driver’ mutations that support rapid expansion but also a larger number of ‘passenger’ mutations acquired before the driver mutation(s). The selective sweeps that cause subpopulation expansion increase the population frequency of both driver and passenger SSMs, driving them to having indistinguishable population frequencies [burrell2013causes, klein2013selection]. However, sampling and technical noise in sequencing means that the observed VAFs are distributed around the true value for a subpopulation. Panel (ii) shows an example histogram of measured VAFs for SSMs found in a heterogeneous tumor sample.
Subclonal reconstruction algorithms define mutation sets, and their associated subpopulations, by analyzing the population frequencies of somatic mutations detected in a tumor sample. In Figure 1, all mutations are SSMs, and all SSMs occur on one copy in diploid regions of the genome. In this case, the estimated population frequency of an SSM is simply twice its VAF. Figure 2, discussed in the next section, shows how CNVs overlapping SSM loci change this relationship. Note that although each VAF cluster corresponds to a subclonal lineage, and a subpopulation that was present at some point during the tumor’s evolution, this subpopulation need not be present when the tumor was sampled. In Figure 1, subpopulation B is no longer present in the tumor, although its two descendant subpopulations are. These vestigial VAF clusters, if they exist, always correspond to subpopulations at branchpoints in the phylogeny, however, not every branchpoint generates a vestigial cluster.
2.1 SSM-based approaches
SSM-based subclonal reconstruction algorithms attempt to reconstruct the subpopulation genotypes based on VAF clusters (and their associated mutation sets) identified by fitting statistical mixture models to the VAF data either without phylogenic reconstruction [pyclone, sciclone, expands, purbayes], before phylogenic reconstruction[recBTP] or concurrently with it[trap, phylosub]. Often, as in Figure 1
, the clusters are overlapping which introduces uncertainty in the exact number of mutation sets represented in the tumor (as well as in the assignment of SSMs to clusters). Adding more clusters to the model always provides better data fit, so to prevent overfitting, the cluster number is selected by balancing data fit versus a complexity penalty (e.g. the Bayesian Information Criteria) or by Bayesian inference in a non-parametric model[pyclone, phylosub, sciclone].
In panel (iii) in Figure 1 the correct number of clusters have been recovered along with appropriate central VAFs.
Assuming that the correct VAF clusters can be recovered, the subclonal lineages corresponding to each mutation set must still be defined. Defining the subclonal lineages is equivalent to defining the tumor phylogeny; and often multiple phylogenies are consistent with the recovered VAF clusters (e.g. panel (iiii) in Figure 1). Complete and correct reconstruction of subpopulation genotypes requires resolving this ambiguity. To do so, reconstruction methods make one of a handful of assumptions about the process of tumor evolution.
A common, and powerful, assumption is the infinite sites assumption (ISA) [kimura69, hudson83, phylosub] which posits that each SSM only occurs once in the evolutionary history of the tumor. The ISA implies that the tumor evolution is consistent with a ‘perfect and persistent phylogeny’ [pyclone]: each subpopulation has all of the SSMs that its ancestors had; each SSM appears in only one subclonal lineage; and each subclonal lineage corresponds to a subtree in the phylogeny of tumor subpopulations. Because SSMs are relatively rare (compared to the genome size), the ISA is nearly always valid for all SSMs. So there is little danger of incorrect reconstructions due to violations of the ISA. In many cases, the ISA alone permits the recovery of multiple, complete subpopulation genotypes from a single or small number of tumor samples using either the sum rule [phylosub] (also called the pigeonhole principle [nik12]) or the crossing rule [phylosub], respectively. Methods that do not use the ISA require, in the case of no measurement noise, at least as many tumor samples as there are subpopulations [trap, clomial]; in actual application when there is noise, even more samples are required.
Unfortunately, the ISA alone is often unable to fully resolve reconstruction ambiguity. As such, some methods [recBTP, trap] also make a sparsity assumption to select among ISA-respecting phylogenies consistent with the VAF frequency data. This assumption, which we call strong parsimony, posits that due to expansion dynamics, there are a small number of subpopulations still present in the tumor [recBTP, trap], and that many of the VAF clusters are vestigial. These methods therefore select the phylogeny (or phylogenies) that maximize the number of vestigial VAF clusters [trap], or equivalently, the number of branchpoints where the parental subpopulation has a zero frequency in the current tumor[recBTP, trap]. The strong parsimony assumption does resolve some ambiguity, and leads to the correct reconstruction in Figure 1, but it is risky as its empirical validity is not yet established. For example, under some conditions, a linear (i.e. non-branching) phylogeny can be mistaken for a branching one; the risk of this occurring increases as the VAF measurement noise or the number of subpopulations in the tumor increases. This background distribution of false positive vestigiality is not yet considered by either of the methods that assume strong parsimony.
By assigning all SSMs within a VAF cluster to the same mutation set; reconstruction methods make another implicit assumption which we call weak parsimony. This assumption does not hold if two mutation sets have the same population frequency. Note that if the ISA is valid, by the pigeonhole principle, weak parsimony is guaranteed to be valid whenever the population frequency of the mutation set is .
Table 1classifies reconstruction methods based on these assumptions, whether they recover complete subpopulation genotypes (or simply identify subclonal lineages), and whether they can handle single tumor samples, multiple tumor samples, or both.
|PropertyMethod||PhyloWGS||PhyloSub [phylosub]||THetA [theta]||PyClone [pyclone]||TrAp [trap]||Clomial [clomial]||RecBTP [recBTP]|
PhyloWGS, like its predecessor PhyloSub[phylosub]
, does not make the strong parsimony assumption nor does it only report a single tree. Instead it reports samples from the posterior distribution over phylogenies. Because the clustering of the VAF frequencies is performed concurrently with phylogenic reconstruction, PhyloWGS is able to perform accurate reconstruction even when the weak parsimony assumption is violated in a strict subset of the samples available. For example, if the VAF clusters overlap in one sample but not another. Our Markov Chain Monte Carlo (MCMC) procedure samples phylogenies from the model posterior that are consistent with the mutation frequencies and does not rule out phylogenies that are equally consistent with the data. From this collection of samples, areas of certainty and uncertainty in the reconstruction can be determined.
2.2 CNV-based approaches
There are three major differences between CNV-based reconstruction and SSM-based reconstruction. First, because large regions of the genome are affected by CNVs and reads mapping across the regions can be used to estimate average ploidy, accurate quantification of changes in average copy number can be achieved with much smaller read depths (as low as 5-7x) [bicseq, theta]. However, the other two differences make CNV-based subclonal reconstruction more difficult and less generally applicable compared with SSM-based methods. One difference is that the ISA is often invalid because CNVs affect large regions of the genome. As such, it is more common to see overlapping mutations occurring in independent cells; these make the reconstruction problem more challenging. Even in the circumstance that only one CNV affects a given region, inferring its population frequency is still challenging because at least two values, population frequency and non-negative integer copy number , have to be simultaneously inferred from a single observed, non-normal average copy number . In particular, this equation,
always has at least two different solutions for .
In absence of other information, like B-allele frequencies [nik12], parsimony assumptions are relied upon to resolve reconstruction ambiguities. One strategy only attempts to reconstruct the cancerous, subclonal lineage [absolute, cnanorm] with the highest population frequency (also known as the ‘clonal population’). From this reconstruction, the proportion of cells in the tumor sample that are cancerous (i.e. the cellularity), as well as detecting CNVs that are shared by all cancerous cells in the tumor can be derived. However, this approach can fail when there are multiple subclonal populations, especially if they share few CNVs [theta, somatica]. Methods that attempt to detect cancerous subpopulation do so by balancing data fit with a complexity term that penalizes additional subpopulations[theta, somatica]. So far, these methods seem to be practically limited to a small number of cancerous subpopulations (i.e., ); and cannot be applied to tumors with substantial rearrangements.
2.3 Combining SSMs and CNVs
In loci affected by CNVs, computing the population frequency of an SSM from its VAF requires knowing whether the SSM occurred before, after or independently of the CNV. If the SSM occurred before the CNV, and CNV affects the copy number of the SSM, then computing its VAF also requires knowing the new number of maternal and paternal copies of the locus. Figure 2 illustrates a situation where incorporating CNV information is critical for subclonal reconstruction. Without CNV information, the two VAF peaks would be interpreted as two separate subclonal lineages. With CNVs, it becomes clear that the second peak is caused by the amplification of part of the genome that increases the VAF of all SSMs found in the region.
Some subclonal reconstruction methods simply ignore the impact that CNVs have on the relationship between SSM population and allele frequency [trap, purbayes]. Other methods that do account for the effect of copy number changes on SSM frequencies [pyclone, phylosub, expands] by integrating over all the possible relationships between allele frequency and population frequency without making use of the fact that the infinite sites assumption, which was necessary to uniquely associate SSMs to subclonal lineages in the first place, also constrains this relationship [nik12].
For the first time, we describe an automated method, PhyloWGS, that performs subclonal reconstruction using both CNVs and SSMs. By combining information from both CNVs and SSMs, and properly accounting for their interaction, we provide a more comprehensive and accurate description of subclonal genotype.
In the following, we first provide a brief explanation of how PhyloWGS incorporates both SSMs and CNVs in phylogenic reconstruction by converting CNVs into pseudo-SSMs and performing subclonal reconstruction on the SSMs and pseudo-SSMs; full details are provided in the Methods section. Then, we show an illustrative example where accounting for the effect of CNVs on SSMs permits the correct subclonal reconstruction of a tumor population whereas using either CNV or SSM data in isolation does not. Then, we describe our efforts to quantify the relationship between read depth and the number of subpopulations that can be accurately recovered by applying PhyloWGS to simulated WGS data of different read depths, number of subpopulations, and SSMs. Next, we describe the application of PhyloWGS to three TCGA benchmark datasets. Finally, we describe the application of PhyloWGS to two real datasets: a multiple-sample WGS data from a patient with chronic lymphocytic leukemia and a single sample from a breast tumor.
3.1 Incorporating CNVs with SSMs in phylogenic reconstruction
We assume that a CNV algorithm has already been applied to the sequencing data and that this algorithm provides estimates of copy number and population frequency for each CNV . We use these estimates in two ways: first, for each CNV, we create an equivalent pseudo-SSM with population frequency by adding an SSM to the dataset with total reads and variant reads rounded to the nearest whole number (i.e., the expected number of variant reads of a heterozygous mutation with population frequency ) where
is set to a user-defined multiple of the average WGS read-depth. If a confidence interval foris available, we can set to have the same confidence interval. Note, we allow multiple CNVs to affect the same locus; each of these CNVs is assigned its own pseudo-SSM.
Second, we associate all SSMs within the region affected by the CNV to this pseudo-SSM. Our model (described in the Methods section) uses this association to compute the transformation from the inferred population frequency of an SSM to its expected VAF.
Here, we briefly describe how this transformation when there is only one CNV affecting the SSM locus; the Methods section describes the general version of the transformation, used by PhyloWGS, that allows multiple CNVs to affect the locus.
Given the population frequency of the CNV, , the copy number of the CNV broken down into maternal and paternal components , and the population frequency of the SSM, , the equations below compute the expected allele frequency of the SSM . Here we are using the terms ‘maternal’ and ‘paternal’ simply to distinguish the two copies and not to suggest that we have actually assigned each chromosome to each parent. Furthermore, the description here assumes that the SSM is on the maternal copy, if it is on the paternal copy, replace with below.
If a SSM lies in a region affected by a CNV, there are three possibilities for their phylogenic relationship:
SSM precedes the CNV event, i.e., the CNV occurred in a cell already containing the SSM
SSM occurs after the CNV event, i.e., the SSM occurred in a cell already containing the CNV
SSM and CNV occurred in separate branches of phylogeny, i.e., the mutations occur in separate cells and no cell contains both the SSM and the CNV.
Case 1 (SSM Cnv)
Because of the infinite sites assumptions, this phylogenic relationship requires . In this case, cells with the CNV contain copies of the SSM and cells with the SSM but not the CNV only have one mutated copy. As such, the expected allele frequency can be written as:
The numerator corresponds to the average number of copies per cell of the SSM-mutated locus in the population and the denominator is the average number of copies per cell of the locus (mutated or not) in the population. We note that if there is no copy number change in then the numerator is simply ; and if then the numerator is .
Case 2 (CNV Ssm)
This case is only possible if the maternal locus still exists after the CNV (i.e. ), and furthermore that . By the infinite sites assumption, only one copy of the locus is affected, so the numerator is simply and we do not need to know the breakdown of into and . As such:
Case 3 ()
In this case, the SSM and CNVs lie on different branches of the phylogeny and no cell in the population contains both mutations, so the only constraints on and are that . As per Case 2, the average number of loci affected by the SSM is . So the expected allele frequency is identical to case 2.
We illustrate some of the ways how the relationship between a CNV and an affected SSM in the phylogenetic tree affects the observed VAF of that SSM in Figure 3.
We note that the breakdown of into and and phasing the SSM only affects the expected variant allele frequency in Case 1. This is because it is the only case where a CNV event can affect a mutated locus. Although PhyloWGS requires the breakdown of into and under these conditions, we do not require the SSM to be phased, as many cannot be[nik12], and instead consider both possibilities when computing the likelihood. Some subclonal copy number callers decompose into and [titan]; if the caller does not provide this decomposition, then PhyloWGS should be run on loci where .
An important consequence of these rules is that under some conditions, it is possible to unambiguously identify a branching phylogeny using single-sample data. If an SSM can be phased to an amplified locus there are situations where given particular values of , , and one can distinguish between Case 1 and Case 3. For example, given , for Case 3 the inferred is . However, if Case 1 were true the resulting inferred would be negative and so Case 1 is not possible. This condition holds whenever . We were unable to find any other circumstances in which single sample VAFs were more consistent with a branching phylogeny than a chain phylogeny.
3.2 The combination of CNVs with SSMs is required for accurate subclonal reconstruction
Consider a tumor where 25% of the cells are normal (no SSMs and diploid, population A), 25% come from a subpopulation with only SSMs (SSM1-4, population B) and 50% belong to a descendant subpopulation of B containing all the SSMs from B and adding new simple somatic variants (SSM5-8) and a homozygous deletion (CNV1) in the region containing SSM4, labelled population C. The evolutionary tree of this population is shown in Figure 4A. In reads sampled from this population, the expected variant allele frequencies for SSM1-3 are 37.5% (i.e. half of their population frequency) and for SSM5-8 they are 25%; however based on the rules described in the Methods section, the expected variant allele frequency of SSM4 is 25%. This is because all the copies of the genome at that position come from population A or B. Population A and B are present in equal proportions and only one copy in population B contains variant reads, so 25% of the genomes contain the variant allele. As such, methods that do not incorporate the CNV change at the SSM4 locus will incorrectly assign SSM4 to population C. Also, methods that incorporate only CNV information cannot detect the subpopulation B which is defined by SSM alone.
We generated simulated variant and reference allele counts for this example at a simulated read-depth of 60. A table containing the reference and total read counts for each SSM can be found as Table 2. PhyloWGS was able to correctly reconstruct the evolutionary history and subpopulation structure (Figure 4B). However, a version of PhyloSub which ignored CNVs incorrectly assigned SSM4 to population C (Figure 4C). Furthermore, by construction, there is no way to recover population B based only on CNV data, so a perfect CNV-based algorithm would infer the subclonal structure in Figure 4D.
|mutation id||reference counts||total counts|
We also ran PyClone [pyclone] on this dataset. PyClone cannot take as input the fact that a locus has been homozygously deleted, so we ran PyClone either by telling it there were no CNV changes or that there was a deletion of one copy. Without any input CNV alterations, PyClone produced a clustering identical to PhyloSub, while in the single copy deletion state, PyClone placed SSM4 in an additional cluster with no other mutations. As this simple example illustrates, integrating data from both SSMs and CNVs is required for full, and accurate, subclonal reconstruction.
3.3 Applying PhyloWGS to simulated data
An important question in subclonal analysis of tumor samples is estimating how deep sequencing must be in order to recover the subclonal structure. To answer this question we applied PhyloWGS to simulated read counts with known subclonal structure. Our simulations looked at a range of total population counts (3, 4, 5, 6), read depths (20, 30, 50, 70, 100, 200, 300) and number of SSMs per population (5, 10, 25, 50, 100, 200, 500, 1000). For each combination of population count, read depth and SSMs per population we generated simulated tumor data for which the subclonal population frequencies were consistent with both branching and linear phylogenies. For each simulated SSM in subpopulation , reference allele reads () were drawn as:
where is the clonal frequency of population and is the simulated read depth. A table of the values used for the simulations can be found as Table 3.
|Number of populations||values used (linear)|
|4||0.56, 0.25, 0.06|
|5||0.64, 0.36, 0.16, 0.04|
|6||0.71, 0.44, 0.25, 0.11, 0.03|
First, we examined the time needed to complete sampling as a function of the number of SSMs (shown in Figure 5).
In less than three hours on a single core of an Intel i7-4770K, on average, inference could be completed with up to 1000 SSMs (all timing data shown uses the simulated dataset with 5 subpopulations).
To determine the number of subpopulations our algorithm found, we analyzed the sampled tree with the highest complete data likelihood and removed any subpopulations with zero assigned SSMs. We then compared the difference between the number of subpopulations used to generate the data and the number of subpopulations identified by our algorithm. The results of this comparison for ambiguous phylogeny simulations are shown in Figure 6.
Several relationships between simulation parameters and the output of our model can be observed. First, unsurprisingly, increasing the read depth and decreasing the number of subpopulations resulted in increased accuracy in the estimated number of subpopulations. Second, for the ambiguous phylogeny simulations, there is a U-shaped relationship between accuracy and the number of SSMs characterizing each population, where accuracy first increases and then decreases as the number of SSMs increases. This decrease in accuracy with high numbers of SSMs is non-intuitive, as more SSMs provide more information with which to perform inference. However, the Dirichlet process prior sometimes overestimates the number of source components [millerharrison2013]. While this overestimation has not been demonstrated for the tree-structured stick-breaking process used by PhyloWGS, the similarity between the processes makes it likely that this is the case. While some of these errors can be eliminated by ad-hoc removal of clusters with a small number of SSMs, there is not yet a consistent approach to do this, so we leave the results untouched. These results suggest that for three or four subpopulations, a read depth consistent with typical WGS experiments (20x–30x) is sufficient to identify the correct number of subpopulations, while experiments with 200x-300x are needed to resolve tumors with up to six subpopulations.
Another important measure of the performance of our algorithm is how accurately the mapping from population to SSM is. To evaluate this accuracy in a systematic way that accounts for class-imbalance, varying number of SSMs and differing number of clusters we examine the Area Under the Precision-Recall Curve (AUPRC) between the known true co-clustering matrix and the average co-clustering matrix from our samples. A co-clustering matrix is a binary matrix where if SSM and SSM are in the same cluster. The average co-clustering matrix is constructed by taking the average of the co-clustering matrices of each sample in the Markov chain after burn-in and is an estimate of the posterior mean co-clustering matrix of our model. The average co-clustering matrix better predicts the true co-clustering matrix than the co-clustering matrix computed from the maximum data likelihood tree. AUPRC was chosen over Area Under the Receiver-Operator Curve (AUROC) as it is known to be more informative in the presence of class-imbalance [davis2006relationship] which changes as the number of populations increase.
In Figure 7 we plot the resulting AUPRC for our simulation experiments.
As with inferring the number of populations, our method does better as the read depth increases and the number of populations decreases. Unlike the last result, there is no clear relationship between the number of SSMs and the resulting AUPRC. To provide qualitative guidance to users of the meaning of various AUPRC cutoffs, we show several examples of inferred co-clustering matrices with AUPRCs of 0.65, 0.8, 0.9 and 0.98 in Figure 8.
3.4 Simulations with CNV changes
Next, we generated simulated data from a more complex genetic environment. In these cases we simulated data from a tumor with 20% normal tissue, a 40% CNV-free subpopulation with 500 mutations and a descendent subpopulation with another 200 mutations but a substantial CNV affecting 50% of the genome, either an amplification or a deletion. We simulated data with read depth 20, 30, 50, 70, 100, 200 and 300 10 times for each read depth / alteration pair. We then applied PhyloWGS and computed the AUPRC scores. To demonstrate the importance of incorporating CNVs in phylogenetic reconstruction we compared the scores from PhyloWGS with those from PyClone [pyclone]. Performance for both methods can be seen in Figure 9.
Using PhyloWGS results in superior clustering when compared to PyClone for both subclonal amplifications and deletions, with the exception of amplifications with low read-depths, where the performance distributions closely overlap.
3.5 TCGA Benchmark
Next, we apply PhyloWGS to the TCGA variant-calling benchmark 4 dataset [tcga]. The samples we examined consist of a normal population, a cancerous cell-line population HCC 1143 and a spiked-in subclonal descendant of the cancerous population in various proportions with 30x coverage. Starting with the publicly available BAM files we identified locations of possible structural variation using BIC-seq [bicseq] with default parameters, except for the bandwidth parameter, which was set to 1000. We changed the bandwidth parameter because we found the default value of 100 resulted in overly noisy segmentations and highly variable normalized read counts. To identify SSMs and the number of variant and reference reads for each SSM, we reverted the BAM files into unaligned reads using Picard 1.90 [picard]. Reads for each sample were then realigned using BWA 0.6.2 [bwa] and collapsed using Picard. Aligned reads of a cancerous sample and its matched normal were analyzed by two somatic calling tools: MuTect 1.1.4 [mutect] and Strelka 1.0.7 [strelka]. A set of high confidence mutations were extracted by taking an intersection of the calls made by MuTect and Strelka. Previous verification on other tumor/normal pairs showed that this approach achieved precision (data not shown). We first ran THetA [theta] using the output of BIC-seq with the aim of using THetA’s output to provide us with the CNV information that PhyloWGS requires (see Methods section). However, despite the fact that the subclonal population varied from 40% to 10%, THetA returned nearly identical composition inferences for all the samples (see Figure 10). Because of this, we decided that we could not rely on THetA’s copy number calls, we instead simply removed all SSMs in a location where BIC-seq identified possible structural variation. This eliminated most of the SSMs identified, leaving only 62 SSMs from the original 4,344. Despite this small number of SSMs our algorithm was still able to identify the correct number of populations and captured the changing composition of the samples. Also, the inferred SSM content of each cluster was identical in the three separate runs.
3.6 Chronic Lymphocytic Leukemia
Next, we applied PhyloWGS to a data from patient CLL077 extracted from Supplementary Table 7 from a paper describing a Chronic Lymphocytic Leukemia (CLL) dataset[Schuh12]. For this patient, five tumor samples were collected over the course of treatment. We note that our method does not assume or use any temporal relationships in multiple sample data and could equally be applied to multiple samples collected simultaneously. We have previously reported experiments using the targeted resequencing data with average read depth of 100,000x at 17 identified SSMs [phylosub], we now instead use the data from WGS for that same set of mutations, with average read depth of 40x. By examining the number of reference and variant alleles it was clear that the mutation in gene SAMHD1 was at a location that was homozygous in the cancerous subpopulation it was part of. This is because the proportion of variant reads was far above 50% (the expected variant allele proportion for a heterozygous SSM present in every cell of the sample). We simulated the data that a CNA algorithm would find by assuming that the copy number at that location was one in a CNA-defined subpopulation and that the proportion of cells in that population was the same as implied by halving the proportion of variant alleles. After running PhyloWGS on this data, we compare the maximum data likelihood tree with the expert-generated tree found using a semi-manual method and targeted resequencing data in Figure 11. The two trees are nearly identical with the exception of assigning a single SSM to a child of the subpopulation where it is found in the expert tree.
3.7 Breast Tumor
We analyzed data from WGS at 288x coverage from tumor PD4120a, first reported in [nik12] and re-analyzed in [theta]. We confined our analysis to SSMs in genomic regions where THetA and the original analysis agreed on the copy number status of the genome (chr 3,4q,5,10,13,16q,17,19 and 20). These regions contains a total of 26,029 SSMs, of which 4,739 were in regions affected by clonal copy number changes and 2,171 were in regions affected by subclonal copy number changes. We then ran PhyloWGS, PyClone and SciClone on SSMs in regions of normal copy number and on SSMs in both altered and normal copy number. PyClone uses a non-phylogenic correction for copy number alterations and SciClone performs no correction. Based on the semi-manual clustering from [nik12], we identified those mutations assigned to clusters D,C and B with high probability to use as our gold standard for clustering. We then compared AUPR for all three algorithms on the two datasets (see Figure 12). All three algorithms have very similar performance when only looking at SSMs in normal regions (Fig. 12 left panel). PhyloWGS continues to have very high performance when SSMs in regions of copy number alterations are included, while both PyClone and SciClone have much worse performance than PhyloWGS (Fig. 12 right panel).
Our work makes two important contributions to the burgeoning field of subclonal reconstruction. First, we provide the first automated method that integrates SSM and CNV data in the reconstruction of tumor phylogenies. This is an important innovation, previous methods either ignore the impact that CNVs have on SSM allele frequencies [trap, purbayes], or assume that the CNVs affect all (and only) the cells that contain the SSM [pyclone, phylosub, expands]. These assumptions can lead to incorrect inferences about the population frequency of SSMs because how a CNV affects the allele frequency of an SSM depends on its phylogenic relationship with the SSM. Many of the insights about how to integrate SSM and CNV data appear in [nik12]; our work here extends and formalizes these seminal observations while also providing an automated method for phylogenic reconstruction. A further advantage of combining SSMs and CNVs in the phylogenic reconstruction is that CNVs overlapping the SSM locus can provide further constraints on the tree-structure than are provided by SSM frequency alone, and we described one case where it is possible to unambiguously infer branching when an amplification of a SSM-containing locus does not lead to a large increase in the SSM allele frequency. Second, we show that given typical WGS read depths, SSM-based methods are able to accurately reconstruct tumor phylogenies and detect and assign SSMs for at least six subpopulations. Previously it wasn’t clear to what extent this reconstruction is possible; and no automated reconstructions with more than two cancerous subpopulations based on WGS data had been described. Furthermore, we demonstrate the importance of phylogenetic correction of VAFs of SSMs that occur in loci affected by copy number changes when performing subclonal reconstruction. Specifically, we presented results on a breast cancer benchmkar where methods that do not use PhyloWGS’s phylogenic correction perform much worse at recovering subpopulations. Finally, we report examples of accurate subclonal reconstruction for cancer populations with highly reordered chromosomes solely on the basis of SSM frequencies in the regions of normal copy number. On these same data, a state-of-the-art CNV-based method failed to perform the reconstruction.
The current version of PhyloWGS relies on preprocessing the sequencing data with a CNV-based method for subclonal reconstruction. This is because it assumes that initial population frequency and copy number data are already available for the CNVs; furthermore, for amplifications, , it requires to be separated into the relative number of each of the two copies, i.e., . It does not, however, require the SSMs to be phased; in other words, it does not need to know whether the SSMs occurred on the maternal or paternal copy of the chromosome. New CNV-based methods provide [titan]; our work anticipates further developments in subclonal CNV callers. If used with a method that cannot decompose copy number changes into changes in the maternal and paternal loci, PhyloWGS can be restricted to regions of copy number loss (i.e. ), where there is only one possible breakdown. Note that this decomposition is only necessary when an SSM precedes a CNV on the same branch of a phylogeny.
We also note that PhyloWGS does not require the CNV-based preprocessing to be able to detect all of the subclonal populations, and we have shown that PhyloWGS can detect additional populations either defined completely by SSMs or that were not detected by CNV-based methods. This is particularly important because recent CNV-based methods are limited to a maximum of two cancerous populations and those that can detect cancerous subpopulation do so by relying on a strong parsimony assumption. If invalid, this assumption can lead to large errors in subclonal reconstruction because it selects branching phylogenies over chain phylogenies that are equally well-supported by the data.
Indeed our results suggest an alternative strategy for combining SSMs and CNVs in subclonal reconstruction. Regions unaffected by CNVs can be relatively easily detected using methods such as BIC-seq [bicseq]. Even in highly rearranged cancer genomes, there are often non-negligible regions of normal copy number and we have shown that we can achieve reasonably accurate subclonal reconstructions using the limited number of SSMs in regions of normal copy number in the TCGA benchmark.
In the final stages of preparing this manuscript, a new method, cloneHD [cloneHD] was published. Like PhyloWGS, this method combines both SSMs and CNVs in subclonal reconstruction and does so using WGS data from single and multiple samples. However, unlike PhyloWGS, cloneHD does not explicitly perform phylogenic reconstruction, so it is unable to fully account for the phylogenic relationship among SSMs and CNVs when analyzing SSM allele frequency. As such, it is not clear to us that it can correctly solve the subclonal reconstruction problem posed in Figure 1. The cloneHD manuscript also does not extend the limits of WGS-based subclonal reconstruction as none of the examples reconstruct more than two cancerous subpopulations from a single sample. Finally, cloneHD appears to rely on the strong parsimony assumption in order to assess subclonal genotypes, and only reports a single reconstruction, obscuring the uncertainty involved. However, cloneHD does appear to be an interesting and powerful method and we hope that future work can compare the merits and drawbacks of these alternate approaches to subclonal reconstruction.
We have presented a new method, PhyloWGS, that reconstructs tumor phylogenies and characterizes the subclonal populations present in a tumor sample using both SSMs and CNVs. Our method takes as input measures of allelic frequency of SSMs, as well as estimates of population frequencies and copy number for each CNV. PhyloWGS groups the SSMs and CNVs into subpopulations, and estimates the population frequencies and the phylogenic relationship of these subpopulations. PhyloWGS is based on a generative probabilistic model of allele frequencies that incorporates a nonparametric Bayesian prior over trees. The output of PhyloWGS consists of samples from this distribution generated through Markov Chain Monte Carlo and we report the tree that maximizes the likelihood of the data found during the sampling run, if a single point estimate is required. However, unlike our previous PhyloSub method [phylosub], PhyloWGS includes CNVs in its subclonal reconstruction and, in doing so, can correctly account for the effect of CNVs on the VAF of overlapping SSMs. PhyloWGS also runs more than 50 times faster than PhyloSub, making it feasible to apply it to the thousands of SSMs that are found through WGS.
We have applied PhyloWGS to real and simulated data from WGS of tumor samples to demonstrate that subclonal populations can be reliably reconstructed based solely on SSMs from medium depth sequencing (30x-50x). We have also used PhyloWGS to correctly solve a simulated subclonal reconstruction problem that neither an SSM-based nor CNV-based method can solve alone; and to reconstruct the phylogeny and subclonal composition of a highly rearranged sample for which a CNV-based method fails. We also demonstrate that when applied to WGS of time-series samples from a chronic lymphocytic leukemia patient, PhyloWGS recovers the same tumor phylogeny previously reconstructed by applying PhyloSub and a semi-manual method to data from deep targeted resequencing. Finally, we demonstrate state-of-the-art performance in subclonal reconstruction on a breast tumor sample, highlighting the advantages of performoing phylogenetic CNV correction. Our work thus greatly expands the range of tumor samples for which subclonal reconstruction is possible, enabling widespread use of automated subclonal reconstruction for medium-depth WGS sequencing experiments.
6.1 PhyloSub model
Our probabilistic model for read count data is based on PhyloSub [phylosub]. For each SSM that is detected by high-throughput sequencing methods, cells containing the SSM are referred to as the variant population and those without the variant as the reference population. Let and denote the number of reads matching the reference allele A and the variant allele B respectively at position , and let . Let denote the probability of sampling a reference allele from the reference population. This value depends on the error rate of the sequencer. Let denote the probability of sampling a reference allele from the variant population which is set to 0.5 if there are no CNVs – in other words, the SSM is assumed to only affect one of the two chromosomal locations. Let denote the fraction of cells from the variant population, i.e., the SSM population frequency at position , and denote the fraction of cells from the reference population at position . Let denote the Dirichlet process (DP) prior with base distribution and concentration parameter . Samples from the DP are used to generate the SSM population frequencies . The observation model for allelic counts has the following generative process:
The posterior distribution of is .
The Dirichlet process prior in the observation model of allelic counts (1) is useful to infer groups of mutations that occur at the same SSM population frequency [Shah12, pyclone]. Furthermore, being a nonparametric prior, it allows us to avoid the problem of selecting the number of groups of mutations a priori. However, it cannot be used to model the clonal evolutionary structure which takes the form of a rooted tree. In order to model this, we use the tree-structured stick-breaking process prior [AdamsGJ10] denoted by . The parameters and influence the height and width of the tree respectively and are similar to the concentration parameter in the Dirichlet process prior. Let denote the set of unique frequencies in the set , where is the number of subclones or nodes in the tree. In other words, multiple elements in the set will take on the same value from the set of unique frequencies. The prior/base distribution
of the SSM population frequencies is the uniform distributionfor the root node and for any other node in the tree, where denotes the parent node of and is the set of siblings of . This ensures that the clonal evolutionary constraints (discussed below) are satisfied when adding a new node in the tree. Given this model and a set of observations , the tree structure and the SSM population frequencies are inferred using Markov Chain Monte Carlo (MCMC) sampling (see PhyloSub [phylosub] for further details).
Given the current state of the tree structure, we sample SSM population frequencies in such a way that the SSM population frequency of every non-leaf node in the tree is greater than or equal to the sum of the SSM population frequencies of its children. To enforce this constraint, we introduce a set of auxiliary variables , one for each node, that satisfy , and rewrite the observation model for allelic counts (1) explicitly in terms of these variables resulting in the following posterior distribution:
where we have used to denote the auxiliary variables for each SSM. The prior/base distribution of the auxiliary variables is defined such that it is 1 for the root node and for any other node in the tree. When a new node is added to the tree, we sample and update . This ensures that . This change is crucial as it allows us to design a Markov chain that converges to the stationary distribution of . The SSM population frequency for any node can then be computed via , where and are the sets of all descendants and children of node respectively. This construction ensures that the SSM population frequencies of mutations appearing at the parent node is greater than or equal to the sum of the frequencies of all its children. We use the Metropolis-Hastings algorithm [hastings1970monte] to sample from the posterior distribution of the auxiliary variables (2) and derive the SSM population frequencies from these samples by selecting the sampled set of population frequencies with the highest likelihood. We use an asymmetric Dirichlet distribution as the proposal distribution.
6.2 Integrating CNV data into PhyloSub
The focus of our new method, PhyloWGS is integrating SSM frequencies with existing copy number variation (CNV) based subclonal reconstructions. As mentioned above, our algorithm takes as input a set of SSMs along with their allele frequencies, expressed for each SSM , as the number of reads at the locus supporting either the SSM () or the reference allele (). We also allow our algorithm to take a set of inferred copy number changes, where for each change , the input provides the new copy number as well as the proportion of the population with the change . In some cases, we also require the breakdown of into the new number of maternal () and paternal () copies of the locus (see below for details). If this breakdown is not available, we can restrict our attention to CNVs for which because in these cases, there is only one possible breakdown. Also, in the absence of paternal/maternal breakdown, we should still be able to, in theory, assign SSMs with overlapping CNVs with to specific populations once the phylogeny and subclonal populations have been defined using SSMs and CNVs in regions of .
Below, we describe the rules, based on the infinite sites assumption, that we use to determine the relationship between the population frequency of an SSM and its observed variant allele frequency . When the SSM does not overlap a region that has a predicted CNV in any cell in the tumor population, then the predicted allele frequency is simply half of the modeled population frequency. We also describe the method by which we transform each CNV into a pseudo-SSM to be included in the phylogeny.
6.2.1 If CNVs do not overlap with any SSM
If a CNV occurs in a region without any SSMs, we generate a ‘pseudo-SSM’ for the CNV which is represented in the model as a heterozygous, binary somatic mutation with a read depth that reflects the uncertainty in the provided population frequency for the CNV. Specifically, we generate reference and variant read counts, and , respectively, such that the allelic frequency is approximately equal to and the total number of reads is selected based on the evidence supporting the CNV. Generating this pseudo-SSM allows the CNV to be treated like any other SSM in the phylogeny model.
6.2.2 If CNVs overlap with SSMs
If a structural variant occurs in a region with an SSM , this complicates the relationship between the proportion of cells that contain the SSM and the expected number of reads because cells with the CNV will have more (or fewer) than two copies of the locus where the SSM lies. Assuming equal sampling of these regions, the expected proportion of reads without the mutation () is always: where is the number of copies of the locus that have the reference allele and is the number of copies of the locus with the variant allele. To account for sequencing error we define as the probability of reading the reference allele when the locus contains the variant allele and vice-versa. The expected proportion of reads containing the reference allele is then:
Looking at a tumor sample with multiple populations and without structural variations, if each population is present with proportion and where is if population contains the SSM and otherwise, then and . This is equivalent to an algorithm that looks at each population and performs the following update. If the population contains the SSM then
If the population does not contain the SSM then:
To take into account CNVs requires a more complex procedure. For each population, for each SSM, the number of reference and variant alleles depends on the copy number of the locus and, potentially, number of maternal () and paternal () copies of the locus as well as the evolutionary relationship between the SSM and the CNV. The infinite sites assumption does not apply for CNVs, adding a further level of complexity because multiple CNVs at the same locus are possible. For each population, the CNV that affects its contribution to the number of reference and variant genomes can be found by ascending the evolutionary tree towards the root. The first CNV found in this ascent is the CNV relevant for the population. If no CNV is found than the population is not affected by a CNV. For each population there are five possible situations:
The population does not contain the SSM and is not affected by a CNV
The population does not contain the SSM but is affected by a CNV
The population contains the SSM but is not affected by a CNV
The population contains the SSM and is affected by a CNV, and the SSM occurred after the CNV
The population contains the SSM and is affected by a CNV, and the CNV occurred after the SSM
If a population does not contain the SSM, then even if a CN change has occurred (cases i and ii), the update rule is:
If a population contains the SSM and the SSM occurred after a CN change (or there was no CN change) (cases iii and iv) then there is a single copy of the mutated genome and the remainder are reference, so the the update rule is:
If a population contains the SSM and the SSM occurred before the CN change (case v) then there are two possibilities, the SSM is on the maternal copy or the paternal copy. If the SSM is on the maternal copy, the update rule is:
If however, the SSM is on the maternal copy, the update rule is:
Note that the breakdown of into and is only required if the CNV occurs after the SSM on the same branch.
Now that we can calculate and , the observation model for the allelic counts has the following generative process (cf. (1)):
Note that in some circumstances, a SSM can be placed on a particular copy of the chromosome by looking for reads that cover the SSM and nearby heterozygous germline mutations. If this is not possible then the likelihood of is the average of two likelihoods; the likelihood if the SSM occurs on the maternal genome and the likelihood if the SSM occurs on the paternal genome.
6.3 Extension to multiple samples
Our model can be easily extended to multiple tumor samples. We make no assumptions regarding the time that the samples were collected, so this extension is equally applicable to multiple samples collected simultaneously (e.g. as in [Gerlinger12]) or over a period of time as in [Schuh12]. We allow the tree-structured stick-breaking process prior to be shared across all the samples. The main technical difference between the single and the multiple sample models lies in the sampling procedure for SSM population frequencies. In the multiple sample model, we ensure that the clonal evolutionary constraints are satisfied separately for each tumor sample and then make a global Metropolis-Hastings move based on the product of posterior distributions across all the samples (cf. (2)).
6.4 MCMC settings
In all the MCMC experiments, we fix the number of MCMC iterations to 2,500 and use a burn-in of 100 samples. We also fix the number of iterations in the Metropolis-Hastings algorithm to 5,000 and set the scaling factor for the Dirichlet proposal distribution to (see PhyloSub paper [phylosub]). We use the CODA R package [coda] for MCMC diagnostics to monitor the convergence of the samplers using the complete-data log likelihood traces and the corresponding autocorrelation function.
6.5 Sequencing error
It is becoming increasingly clear that sequencing error is not uniform across the genome and different trinucleotide sequences result in different sequencing error rates [boutros2014global]. While the precise nature of these differences is not yet fully known, PhyloWGS allows the user to input a different sequencing error rate for each mutation.