Cancer is a disease of the genome, in which a range of structural genomic changes to the DNA of otherwise healthy cells, results in a cancer phenotype. A well-studied subset of these aberrations is single nucleotide variants and short insertion/deletion mutations. Identification of these through next generation sequencing for a given individual can now be commonly used in clinical practice to provide targeted and personalized therapies; a field known as precision oncology Bungartz2018MakingTR. However, the deconvolution of these somatic variants from sequencing error provides a somatic/artifact classification problem which is yet to be fully resolved.
This problem has been traditionally approached with Bayesian classifiersCibulskis2013SensitiveDO; Saunders2012StrelkaAS, which give reasonable accuracy whilst providing information on the robustness of the variant call. More recently, there have been attempts to increase the accuracy of variant calls through neural network-based approaches, such as DeepVariant (germline variants)Poplin2018AUS and SomaticNETESMOnet; Harries2018 (somatic variants). However, these are yet to provide the advantage of output probabilities that better represents the underlying reality of the tumor.
In this study, we aim to build upon our previous work in applying Bayesian Neural Networks (BNNs) to somatic variant calling DubourgFelonneau2019SafetyAR
, an approach which enables the network to learn a posterior probability density function (pdf) over the spaceof the weights, and in turn provide a measure of uncertainty which makes the model more robust. We build on this by investigating the distribution of output probabilities compared to standard neural networks, when applied to real-world data and in various out-of-distribution scenarios commonly observed in disparate sequencing data-sets.
2 Variant Calling
The task, data & labels for this study are consistent with our previous work in this fieldDubourgFelonneau2019SafetyAR as follows:
2.1 The Task
Next generation DNA sequencing techniques produce overlapping short sequences of DNA called reads. For a fixed position in a sequenced genome, we observe a number of reads originating from multiple cells. We call this number the depth. When, for a given patient, we sequence both tumor and normal tissues for comparison, we expect to see a number of true somatic variants (in the tumor tissue but not in the normal tissue), although the vast majority of the variations observed are due to sequencing noise. The task we undertake here involves the differentiation of somatic variants (tumor) from germline variants (normal) and sequencing error (noise) .
2.2 The Data
We developed a bioinformatics pipeline to identify genomic loci where sequencing data from the tumor significantly differs from the normal. We extract the data in the following form:
: the depth
: the width or number of observed loci
: The normal base at position and depth
: The tumor base at position and depth
2.3 The Labels
For Training, we used the Multi-Center Mutation Calling in Multiple Cancers (MC3) high confidence variant dataset MC3paper. The labels for this dataset are binary: 1 if the variant is validated positive, 0 if the variant is validated negative. We did not include non-validated data.
The input data is the aforementioned matrices with a specific size of (100, 20) with three color channels. We reshape the input data from (100, 20, 3) to (100, 60), diluting the color information to analyze using LSTMs. Secondly, we balanced the dataset by undersampling the majority class. This resulted in a total of 103,868 (image, label) pairs. For these data we used a 0.8 training-test split.
We start from the Bayesian Neural Network architecture demonstrated in Figure 2
, implemented using Tensorflow and the Tensorflow Probability library. To test its robustness and performance we compare it with a standard neural network, identical with the exception of substituting the variational layer DenseFlipout with a Dense layer.
These methods build on much of our previous work on a suitable cost function for this type of data and this classification problemDubourgFelonneau2019SafetyAR.
We use variational layers during training with data , to learn an approximate probability density function of the true, usually intractable, posterior for the weights of the network. This formally should be done by minimizing the KL divergence between the two. Minimizing this KL divergence is equivalent to minimizing another cost function, known as -ELBO (Evidence Lower Bound)weightuncertainty
where and are the likelihood of the data and the prior respectively for the weights , and is an approximate pdf to the true posterior.
Practically, as the integrals in 1 are in general analytically intractable, they are computed via Monte Carlo (MC) sampling from the approximate pdf . This is automatically done by the variational layers in tfp.
Secondly, as it is costly to calculate the ELBO for large datasets such as those in DNA sequencing, tfp allows to do minibatch optimization. Therefore, 1 is calculated as an average over many ELBOs calculated on batches of data with sampled weights
This approximation is an unbiased stochastic estimator for the ELBOweightuncertainty. As a result, the calculated gradient will not be miscalculated, allowing convergence to the correct solution.
Flipout estimator flipout
provides the unbiased ELBO estimation with lower variance than other methods (although larger numerical cost). This is well appreciated, since the optimization of ELBO depends largely on its variance.
Once the pdf is fitted by minimizing -ELBO, one can infer with test datum , by computing:
In practice, this is done by Monte Carlo sampling. We substitute the integral with an averaged sum over the weights sampled from the approximate posterior, then we take the average.
In a classification problem, this means that for each possible outcome , where is the set of all outcomes, we will calculate this estimate , ensuring that .
As we will show later, the nice property of this implementation of a BNN, is that it becomes more uncertain to out of distribution samples with respect to the standard network.
As it is known, in particular for very deep one, standard neural networks suffer from uncalibration of the probability that comes from the logits. A way to overcome partially this is temperature scalingtemp.
We applied temperature scaling temp calibration on the standard network in order to make the comparison more fair. Temperature scaling consists of minimizing the cross-entropy, after the network is trained, by fitting a new parameter with validation data, defined as follows:
In order to evaluate the calibration of a network we use Expected Calibration Error (ECE) temp, the results of which are displayed in Figure 3. This is done by sorting the predictions and partitioning them in bins. We compute, per bin, the absolute difference between the accuracy and the confidence, weighted by the bin probability.
where is the bin, the number of examples, , ,
the label, prediction and confidence vector for bin.
The two models have a similar accuracy (Variational: 80% - Dense: 81%), although we observe a large difference in their output distributions. The variational inference outputs greater density around 0.5 than the standard network. A similar output is achieved with temperature scaling (Figure 4).
Intuitively, if the distribution of the testing data is too different from the distribution of the training data (e.g. batch variation, such as samples with a different distribution of sequencing error), we expect the confidence of the model to decrease accordingly. Such a behavior is not observed in common neural networks. To show this, we generated out-of-distribution testing data by adding Gaussian noise to the input. We can observe (Figure 5) that the standard neural network is over-confident, whereas the variational approach shows a greater tendency to reduce the overall scatter between positive and negative predictions than the classic approach.
Similarly, differing sequencing depth is another commonly-occurring example of sample variability. If the network is used in a clinical setting on data with significantly lower depth than the training data, it is important for the confidence to be proportionally lowered. This is reflected in the variational model (Figure 6), with modes closer to 0.5 and a smaller variance than standard neural networks.
Overall, these results suggest that the variational model is best suited to disparate and highly-variable genomic sequencing data-sets, providing output probabilities that reflect the test data well.
We have shown in this paper that, for the application of somatic variant calling in cancer, not only can BNNs be utilized instead of standard neural networks without significant performance degradation, but also for obtaining more representative probability values for calls. Specifically, in regions where data is sparse or out-of-distribution, the network becomes uncertain and reflects this well in the output scatter. This does not occur in standard NNs, where the network outputs values of high confidence regardless of out-of-distribution scenarios. We aim to deploy these models to enable flexible but robust molecular profiling of tumors, for superior application of targeted therapies in precision oncology.