Maximum n-times Coverage for COVID-19 Vaccine Design

01/24/2021 ∙ by Ge Liu, et al. ∙ MIT 0

In the maximum n-times coverage problem, we are provided a set of elements, a weight for each element, and a set of overlays where each overlay specifies an element specific coverage of zero or more times. The goal is to select up to k overlays such that the sum of the weights of elements that are covered at least n times is maximized. We also define the min-cost n-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least n times is at least τ. We show that the n-times coverage objective is not submodular, and we present an efficient solution by sequential greedy optimization. We frame the design of a peptide vaccine for COVID-19 as maximum n-times coverage using machine learning defined candidate peptide sets, and show that our solution is superior to 29 other published COVID-19 peptide vaccine designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

The COVID-19 pandemic has caused an unprecedented challenge for healthcare systems worldwide, and developing a vaccine that produces durable immunity with high population coverage is of upmost importance. A peptide vaccine uses a set of peptides to elicit a protective cellular immune response (Ott et al., 2017; Li et al., 2014). For a peptide to be effective in a vaccine it must be presented by an individual’s Major Histocompatibility Complex (MHC) molecules and be immunogenic. A displayed peptide is immunogenic when it activates T cells, which expand in number and mount a response against pathogens or tumor cells. Memory T cells provide robust immunity against COVID-19, even in the absence of detectable antibodies (Sekine et al., 2020). The use of peptide vaccine components to activate T cells is in development for cancer (Hu et al., 2018) and viral diseases including HIV (Arunachalam et al., 2020), HPV (Kenter et al., 2009) and malaria (Li et al., 2014; Rosendahl Huber et al., 2014).

A challenge for the design of peptide vaccines is the diversity of human MHC alleles that each have specific preferences for the peptide sequences they will display. The Human Leukocyte Antigen (HLA) loci, located within the MHC, encode the HLA class I and class II molecules. We consider the three classical class I loci (HLA-A, HLA-B, and HLA-C) and three loci that encode class II molecules (HLA-DR, HLA-DQ, and HLA-DP). An individual’s HLA type describes the alleles they carry at each of these loci. Peptides of length 8–10 residues can bind to HLA class I molecules whereas those of length 13–25 bind to HLA class II molecules (Rist et al., 2013; Chicz et al., 1992). To create effective vaccines it is necessary to consider the HLA allelic frequency in the population, as well as linkage disequilibrium between HLA genes to discover a set of peptides that is likely to be robustly displayed.

For a peptide vaccine to be effective it needs to invoke a robust cellular immune response with a diverse set of peptides (Schultheiß et al., 2020; Grifoni et al., 2020a), which motivates our use of -times coverage. Redundant coverage is also important to compensate for our imperfect ability to predict what peptides will be immunogenic in a given individual. We adopt an experimentally calibrated model of peptide-HLA immunogenicity to create a set of candidates and perform population coverage optimization to select a compact set of peptides that provides diverse display of peptides in each individual. Our vaccine designs and evaluations are based upon the observed immunogenicity of peptides in convalescent COVID-19 patient peripheral blood mononuclear cell samples (Snyder et al., 2020; Klinger et al., 2015; Nolan et al., 2020), and by ensembling machine learning predictions for peptides or HLA alleles that are not observed in the clinical data Liu et al. (2021).

We formalize the number of times that an individual displays peptides as a coverage problem, and thus we generalize single coverage problems to -times coverage problems. We show that framing vaccine design as maximum -times coverage produces a solution that produces superior predicted population coverage when compared to 29 previous published vaccines for COVID-19 with less than 150 peptides. We also show that maximum -times coverage is not submodular, and introduce MarginalGreedy, an efficient algorithm for solving -times coverage problem on both synthetic data and real vaccine design.

In the maximum -times coverage problem, we are provided a set of elements, a weight for each element, and a set of overlays where each overlay specifies an element specific coverage of zero or more times. The goal is to select up to overlays such that the sum of the weights of elements that are covered at least times is maximized. In this framing, an element is a specific collection of HLA alleles (a haplotype), weights are the frequency of haplotypes in the population, is the desired number of peptides displayed by each individual, and an overlay is a peptide that is predicted to be displayed by each haplotype a specified number of times. The solution of the maximum -time coverage problem allows us to find a set of overlays (peptides) that maximizes the sum of weights (population coverage).

Our introduction of weighted elements and required -times coverage creates a new class of problem without a known solution or complexity bound. The closest problem to the maximum -times coverage problem is the multi-set multi-cover problem which does not assign weights to the elements and thus can be formulated as the Covering Integer Program (CIP) problem (Srinivasan, 1999). A -time approximation algorithm for CIP can violate coverage constraints Dobson (1982); Kolliopoulos (2003); Kolliopoulos and Young (2001). Other existing robust coverage problems do not address the need for explicit -times coverage of elements to ensure each individual is predicted to display multiple peptides. Deletion-robust submodular maximization protects against adversarial deletion (Bogunovic et al., 2018; Mirzasoleiman et al., 2017) and robust submodular optimization protects against objective function uncertainty (Iyer, 2019), but neither guarantees that each element is covered times.

The Maximum -times coverage problem

Set Cover and Maximum Coverage

In the standard Set Cover and Maximum Coverage problems, we are given a set of elements (also known as the universe) and a collection of subsets of such that . The goal in the Set Cover problem is to select a minimal-cardinality set of subsets from such that their union covers . In the Maximum Coverage problem, our goal is to select subsets from such that the union of these subsets contains maximum number of elements. Both problems are known to be NP-hard Feige (1998), but can be solved with a greedy approximation algorithm with the best guaranteed approximation ratio.

A generalization of Set Cover is to use any submodular set function as the objective instead of the cardinality function. Given a set function that assign each subset a value where is a finite set, we define the discrete derivative (or marginal gain) as for . A set function is submodular iff it satisfies the following diminishing return property:

Definition 1.

(Submodularity) A function is submodular if for every and it holds that .

Equivalently, is a submodular function if for every .

We can then define the Submodular Set Cover problem as the problem of selecting a minimal-cardinality subset such that , where is a monotone submodular function and is the ground set of items. A similar problem is the Min-cost Coverage problem:

The Submodular Maximum Coverage problem is defined as selecting a subset that maximizes with cardinality constraint , where is monotone and submodular.

Given is monotone submodular, a greedy algorithm is known to provide approximation for Submodular Maximum Coverage and approximation for Submodular Set Cover (Nemhauser et al., 1978).

Multi-set Multi-cover problem

The Multi-set Multi-cover (MSMC) problem is a generalization of Set Cover, where multi-sets are special type of sets in which an element can appear in for more than once. The objective of the MSMC problem is to determine the minimum number of multisets (a multi-set can be chosen multiple times) such that each element is covered at least times. It can be formulated into Covering Integer Program (CIP) Srinivasan (1999) problem:

Definition 2.

(Covering Integer Program, CIP) Given , a CIP seeks to minimize , subject to ,and .

Here represents the number of times -th element appears in the -th multi-set. The is set to be all 1 in MSMC. The constraints

are called multiplicity constraints which limit the number of times a single multi-set can be reused, and they generally make covering problems much harder as natural linear programming (LP) relaxation can have an unbounded integrality gap 

(Chuzhoy and Naor, 2006). Dobson (1982) provides a combinatorial greedy -approximation algorithm ( stands for -th harmonic number) but multiplicity constraints can be dealt with effectively only in the (0,1) case thus this algorithm can be as bad as polynomial. Kolliopoulos and Young (2001); Kolliopoulos (2003) gave a tighter-bound solution that can obtain -approximation.

Figure 1: Example of -times coverage calculation.

Maximum -times Coverage

We introduce the maximum -times coverage problem, a variant of the MSMC problem that accounts for multiple coverage of each element while also assigning weights to different elements. We are given a set with elements each associated with a non-negative weight , and a set of overlays . Each overlay covers each element in an element specific number of times , which is similar to a multi-set. When the element is not covered by overlay . We use a very strict multiplicity constraint such that each overlay can be used only once. Given a subset of overlays , the total number of times an element is covered by is the sum of for each overlay in :

(1)

We define the -times coverage function as the sum of weights of elements in that are covered at least times by . Figure 1 shows an example computation of the -times coverage function.

(2)

The objective of the Maximum -times Coverage problem is to select overlays such that is maximized. This can be formulated as the maximization of the monotone set function under cardinality constraint ,

(3)

We define Min-cost -times Coverage as the minimum set of overlays such that the sum of the weights of elements covered at least times is . We assume provides sufficient -times coverage for . We define the -times Set Cover problem as the special case of Min-cost -times Coverage where .

(4)
(5)
Theorem 1.

The -times coverage function is not submodular.

Proof.

We show is not submodular for (a similar counter example can be found for any ). Consider the example overlays in Table 1. When , none of ,, or provides -times coverage while covers two times and covers two times. Therefore, the marginal gain of adding , , or into an empty set is always zero, whereas adding into or achieves non-zero gains and adding into achieves even higher gain:

Given that and , the function does not satisfy the diminishing return property (Definition 1) and thus is not submodular. ∎

0.01 0.01 0.5 0.48
0 1 0 1
0 0 1 1
1 1 0 0
Table 1: Coverage map of overlays used in the counter example

Since the -times coverage problem is not submodular, we cannot take advantage of the proven near-optimal performance of the greedy algorithm for the Submodular Maximum Coverage problem. Thus we seek new solutions.

Vaccine population coverage maximization

Peptide vaccine design can be framed as an instance of the maximum -time coverage problem. Peptides must immunogenic to be effective in a vaccine, and we use an integrated model of peptide-HLA immunogenicity (Supplemental Data). We define a peptide-HLA hit as a (peptide, HLA allele) pair where the peptide is predicted to immunogenic and displayed the HLA allele. Once we have determined a candidate set of peptides that are predicted to be immunogenic we then need to select a minimal subset of peptides such that each individual in of the population is predicted to have peptide-HLA hits.

We first introduce EvalVax-Robust

, an evaluation tool for estimating the population coverage of a proposed peptide vaccine set.

EvalVax-Robust evaluates the percentage of the population having at least peptide-HLA binding hits in each individual. It takes into consideration allelic zygosity, and utilizes HLA haplotype frequencies for MHC class I (HLA-A/B/C) and MHC class II (HLA-DP/DQ/DR) genes to account for linkage disequilibrium (LD) among loci.

The MHC class I genes are encoded by the HLA-A, HLA-B, and HLA-C loci and a single chromosome contains one allele at each locus that is described by , , and respectively. The haplotype frequency of is denoted by and . Each individual inherits two haplotypes to form their diploid genotype. The the frequency of a diploid genotype is thus:

(6)

For a given peptide and an HLA allele , our machine learning model outputs a binary hit prediction indicating peptide-HLA immunogenicity. For each diploid genotype we compute the total number of peptide-HLA hits as the sum of over the unique HLA alleles in the genotype. There can be 3-6 unique alleles depending on the zygosity of each locus:

(7)

The total number of peptide-HLA hits provided by a set of peptides is the sum of number of hits per peptide:

(8)

We define the EvalVax-Robust predicted population coverage with peptide-HLA hits for a peptide vaccine set as:

(9)

EvalVax-Robust optimization can be accomplished by a solution to maximum -times coverage. corresponds to the possible genotypes, and the weights are population genotype frequencies . The peptide candidate set is the set of possible overlays , where each peptide is an overlay which covers a genotype times.

The MarginalGreedy algorithm for maximum -times coverage

We observed that greedy solutions to the maximum -times coverage problem are problematic when the -times objective is directly approached. The greedy approach can fail if it is unable to find early overlays with sufficient -times coverage, and can fail later because of early bad overlay choices. Both problems result in subsequent overlay choices not providing sufficient marginal gain to avoid ties and the random selection of subsequent overlays. Thus we needed a solution that did not try to directly solve for -times coverage at the outset, but rather marginally approached it.

Input: Weights of the elements in :, ground set of overlays where each overlay in covers for times, target minimum # times being covered , coverage cutoff for different : , beam size
Initialize beam ,
for  do
     Using set function as objective function.
     repeat
         solution candidate set
         for  do Beam search
              for  do
                                          
          {Top candidate overlay subsets in scored by }
         
          median score of candidate overlay sets in the beam
     until  (in -th cycle the termination condition is )
     (for Maximum -times Coverage with cardinality constraint , there is additional
     termination condition )
     
Output: The final selected subset of overlays
Algorithm 1 MarginalGreedy algorithm (for Min-cost -times Coverage)
Figure 2: The MarginalGreedy algorithm outperforms the greedy algorithm on both LARGE and SMALL toy examples. Superior performance is seen on both the -times Set Cover where a smaller number of overlays is used by MarginalGreedy to achieve 100% coverage at different criteria of , and on the Maximum -times Coverage problems where MarginalGreedy

achieves higher coverage given same number of overlays. Shaded regions indicate 95% confidence intervals.

We propose MarginalGreedy, an efficient algorithm that optimizes -times coverage with a sequence of greedy optimization cycles where the -th cycle optimizes the coverage function . We establish coverage starting at and increase the criteria to . Thus early overlay selections are guided by less stringent cycle-specific coverage objectives. A set of coverage cutoffs is used as the termination condition for each greedy optimization cycle, and when not specified, we assume by default. We use beam search to keep track of top candidate solutions at each iteration.

The full algorithm is given in Algorithm 1. A similar algorithm can be used to solve the -times Set Cover problem in which . For the Maximum -times Coverage problem with cardinality constraint , the optimization terminates when . When beam search is not used to reduce computation time (), we have extended the algorithm to break ties during the -times coverage iteration by looking ahead to -times coverage. We call this extension look ahead tie-breaking.

Another advantage of MarginalGreedy is that it is capable of guaranteeing high coverage for by controlling the cycle-specific coverage cutoffs . This is desired in vaccine design where wider population coverage is also important and we want to make sure that almost 100% of the population will be covered at least once.

Results

MarginalGreedy outperforms a greedy baseline on toy examples

We empirically evaluate our MarginalGreedy algorithm with two toy examples: a LARGE dataset where a set of 30 overlays are randomly generated to cover a set of 10 elements (with equal weights) 0, 1, or 2 times, and a SMALL dataset where a set of 10 overlays that were randomly generated to cover a set of 5 elements with equal weights 0, 1, or 2 times. Figure 2 shows the efficiency of the MarginalGreedy algorithm on both the -times Set Cover and Maximum -times Coverage problems for varying values of . We compare our MarginalGreedy algorithm to a greedy algorithm that directly optimizes -times coverage and find the greedy algorithm degenerates to sub-optimal solutions for large values of . These sub-optimal solutions can be as bad as random. We repeated problem generation and optimization with 6 different random seeds to provide confidence bounds and used beam size with look-ahead tie-breaking. As shown in Figure 2, MarginalGreedy outperforms greedy on all tasks and datasets. We observed that MarginalGreedy has a significant advantage over greedy in tests with larger and in regions of higher coverage. We also compared the performance of polynomial-time MarginalGreedy to an exponential-time exhaustive search procedure that discovers the optimal set of overlays on smaller problems. We found MarginalGreedy produces solutions close to optimal.

Figure 3: EvalVax population coverage evaluation of SARS-CoV-2 vaccines for (A) MHC class I and (B) MHC class II. (a) EvalVax-Robust population coverage with peptide-HLA hits per individual, OptiVax-Robust performance is shown by the blue curve and baseline performance is shown by red crosses (labeled by name of first author). (b) EvalVax-Robust population coverage with peptide-HLA hits. (c) EvalVax-Robust population coverage with peptide-HLA hits. (d) Comparison of OptiVax-Robust and baselines on expected number of peptide-HLA hits.

COVID-19 vaccine design using maximum -times coverage

We used our MarginalGreedy algorithm to design a peptide vaccine for COVID-19 using the EvalVax-Robust objective described in (9). We obtained the SARS-CoV-2 viral proteome from the first documented case from GISAID (Elbe and Buckland-Merrett, 2017) (sequence entry Wuhan/IPBCAMS-WH-01/2019). We used Nextstrain (Hadfield et al., 2018) to identify open reading frames (ORFs) and translate the sequence. We applied sliding windows of length 8–10 (MHC class I) and 13–25 (MHC class II) to identify candidate peptides for inclusion in our peptide vaccine, resulting in 29,403 candidate peptides for MHC class I and 125,593 candidate peptides for MHC class II. We exclude peptides that may be glycosylated Gupta et al. (2004); Wolfert and Boons (2013), cross known cleavage sites Wang et al. (2020); Consortium (2019), overlap with human self-peptides Consortium (2019), or have a mutation rate (as calculated over 4,690 geographically sampled viral genomes) Elbe and Buckland-Merrett (2017); Hadfield et al. (2018) (details in Supplement). We use observed haplotype frequencies of 230 different HLA-A, HLA-B, and HLA-C loci for class I computations and observed haplotype frequencies of 280 different HLA-DP, HLA-DQ, and HLA-DR for class II computations (Liu et al., 2020). This dataset contains independent haplotype frequency measurements for three populations self-reporting as having White, Black, or Asian ancestry. We score peptide-HLA immunogenicity based upon clinical data from convalescent COVID-19 patients Liu et al. (2021). For peptide-HLA combinations not observed in our clinical data we selected a machine learning model of immunogenicity that best predicted the peptide-HLA combinations we did observe. Our selected MHC class I model predicts a peptide-HLA combination will be immunogenic if they are predicted to bind with an affinity of 50nM by the mean of the predictions from PUFFIN (Zeng and Gifford, 2019), NetMHCpan-4.0 (Jurtz et al., 2017) and MHCflurry 2.0 (O’Donnell et al., 2020). Our selected MHC class II model predicts a peptide-HLA combination will be immunogenic if they are predicted to bind with an affinity of 50 nM by NetMHCIIpan-4.0.

We implemented the OptiVax-Robust method for peptide vaccine design using the MarginalGreedy algorithm, but modified MarginalGreedy to reduce vaccine redundancy by eliminating unselected peptides that are within three (MHC class I) or five (MHC class II) edits on a sequence distance metric from the selected peptides at each iteration. We used a beam size of 10 for MHC class I and 5 for MHC class II.

Contemporary peptide vaccine design methods use machine learning scoring of peptide display for specific HLA alleles followed by the manual selection of peptide sets or a greedy search that maximizes coverage at =1. These methods do not utilize the distribution of HLA haplotypes in a population and thus can not accurately assess the population coverage provided by a vaccine. For a selected set of peptides, the IEDB Population Coverage Tool Bui et al. (2006) estimates peptide-MHC binding coverage and the distribution of peptides displayed for a given population but assumes independence between different loci and thus does not consider linkage disequilibrium. However, the IEDB tool does not provide automated peptide selection as we describe.

Figure 4: SARS-CoV-2 OptiVax-Robust selected peptide vaccine set for (A) MHC class I and (B) MHC class II. (a) EvalVax-Robust population coverage at different per-individual number of peptide-HLA hit cutoffs for populations self-reporting as having White, Black, or Asian ancestry and average values. (b) Binding of vaccine peptides to each of the available alleles. (c) Distribution of the peptide origin. (d) Distribution of the number of per-individual peptide-HLA hits in populations self-reporting as having White, Black, or Asian ancestry. (e) Fraction of vaccine peptides that are also peptides of SARS-CoV.

We compared our vaccine designs with 29 vaccine designs in the literature on the probability that a vaccine has at least

peptide-HLA hits per individual in a population, and the expected number of per individual peptide-HLA hits in the population, which provides insight on how well a vaccine is displayed on average (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Figure 3

shows the comparison between OptiVax-Robust (A) MHC class I and (B) class II vaccine designs at all vaccine sizes (top solution in the beam up to the given vaccine size) from 1-45 peptides (blue curves) and baseline vaccines (red crosses) from prior work. We observe superior performance of OptiVax-Robust vaccine designs on all evaluation metrics at all vaccine sizes for both MHC class I and class II. Most baselines achieve reasonable coverage at

peptide hits. However, many fail to show a high probability of higher hit counts, indicating a lack of predicted redundancy if a single peptide is not displayed. Note that OptiVax-Robust also outperforms all baselines on coverage and achieve 99.99% (MHC class I) and 99.67% (MHC class II) coverage for , suggesting its capability to cover almost full population at least once while optimizing for higher peptide display diversity. We also found that OptiVax-Robust outperforms all baselines with an immunogenicity model that does not incorporate clinical data (Supplement Figure S1).

MHC class I results.

We selected an optimized set of peptides from all SARS-CoV-2 proteins using OptiVax. We use our MHC class I integrated model of peptide-HLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we considered 1546 candidate peptides. With OptiVax-Robust optimization, we design a vaccine with 18 peptides that achieves 99.999% EvalVax-Robust coverage over populations self-reporting as having Asian, Black, or White ancestry with at least one peptide-HLA hit per individual. This set of peptides also provides 99.53% coverage with at least 5 peptide-HLA hits and 92.32% coverage with at least 8 peptide-HLA hits (Figure 4A, Table S1). The population level distribution of the number of peptide-HLA hits in White, Black, and Asian populations is shown in Figure 4A, where the expected number of peptide-HLA hits 12.532, 11.425 and 13.326, respectively.

MHC class II results.

We use our MHC class II model of peptide-HLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we had 8003 candidate peptides. With OptiVax-Robust optimization, we design a vaccine with 18 peptides that achieves 99.67% EvalVax-Robust coverage with at least one peptide-HLA hit per individual. This set of peptides also provides 97.22% coverage with at least 5 peptide-HLA hits and 87.99% coverage with at least 8 peptide-HLA hits (Figure 4B, Table S1). The population level distribution of the number of peptide-HLA hits per individual in populations self-reporting as having Asian, Black, or White ancestry is shown in Figure 4B, where the expected number of peptide-HLA hits is 17.206, 16.085 and 11.210, respectively.

Conclusion

We introduced the maximum -times coverage problem, and showed that it is not submodular. We presented a beam search algorithm for solving the problem, and used it to produce a peptide vaccine design for COVID-19. We compared the resulting optimized peptide vaccine design with 29 other published designs and found that the optimized design provides substantially better population coverage for both MHC class I and class II presentation of viral peptides. The use of -times coverage as an objective increases vaccine redundancy and thus the probability that one of the presented peptides will be immunogenic and produce a T cell response. Our methods can also be used to augment existing vaccine designs with peptides to improve their predicted population coverage by initializing the algorithm with peptides that are present in an existing design.

Ethics Statement

This work presents a principled combinatorial design approach for developing peptide vaccines using machine learning methods, and we apply it to develop a novel MHC class I and II vaccine design for COVID-19. On our metrics, the resulting design is predicted to provide better population coverage than 29 existing vaccine designs with less than 150 peptides. More broadly, the EvalVax and OptiVax methods we introduce may be used to evaluate and design peptide vaccines against viral infections or cancer to achieve greater population coverage and robust peptide display per individual over existing methods. The general formulation of the maximum -times coverage problem may find use in other fields, such as economics. One potential consequence of this work is identifying vaccines that do not provide adequate population coverage that could lead to costly failures of vaccine trials. We aim to mitigate this risk via our principled vaccine formulation objective and conservative approach for defining the candidate peptide set.

Acknowledgements

This work was supported in part by Schmidt Futures, a Google Cloud Platform grant, NIH grant R01CA218094, and a C3.AI DTI Award to D.K.G. Ge Liu’s contribution was made prior to joining Amazon.

Code Availability

An open source implementation of the methods described can be found at

https://github.com/gifford-lab/optivax.

Author Contributions

G.L., B.C., and D.G. contributed to problem definition and solution. G.L. designed and implemented the optimization procedure, with advice from B.C. and D.G. G.L., B.C., and D.G. wrote the paper.

Declaration of Interests

Brandon Carter is an employee of Think Therapeutics, and Ge Liu is an employee of Amazon. David Gifford and Brandon Carter have a financial interest in Think Therapeutics, Inc.

References

Details of Candidate Peptide Filtering

For our SARS-CoV-2 peptide vaccine design, we eliminate peptides that are expected to mutate and thus cause vaccine escape, peptides crossing cleavage sites, peptides that may be glycosylated, and peptides that are identical to peptides in the human proteome.

Removal of mutable peptides.

We eliminate peptides that are observed to mutate above an input threshold rate (0.001) to improve coverage over all SARS-CoV-2 variants and reduce the chance that the virus will mutate and escape vaccine-induced immunity in the future. When possible, we select peptides that are observed to be perfectly conserved across all observed SARS-CoV-2 viral genomes. Peptides that are observed to be perfectly conserved in thousands of examples may be functionally constrained to evolve slowly or not at all.

For SARS-CoV-2, we obtained the most up to date version of the GISAID database (Elbe and Buckland-Merrett, 2017) (as of 2:02pm EST May 13, 2020, acknowledgements in Supplementary table) and used Nextstrain (Hadfield et al., 2018)111from GitHub commit 639c63f25e0bf30c900f8d3d937de4063d96f791 to remove genomes with sequencing errors, translate the genome into proteins, and perform multiple sequence alignments (MSAs). We retrieved 24468 sequences from GISAID, and 19288 remained after Nextstrain quality processing. After quality processing, Nextstrain randomly sampled 34 genomes from every geographic region and month to produce a representative set of 5142 genomes for evolutionary analysis. Nextstrain definition of a “region” can vary from a city (e.g., “Shanghai”) to a larger geographical district. Spatial and temporal sampling in Nextstrain is designed to provide a representative sampling of sequences around the world.

The 5142 genomes sampled by Nextstrain were then translated into protein sequences and aligned. We eliminated viral genome sequences that had a stop codon, a gap, an unknown amino acid (because of an uncalled nucleotide in the codon), or had a gene that lacked a starting methionine, except for ORF1b which does not begin with a methionine. This left a total of 4690 sequences that were used to compute peptide level mutation probabilities. For each peptide, the probability of mutation was computed as the number of non-reference peptide sequences observed divided by the total number of peptide sequences observed.

Removal of cleavage regions.

SARS-CoV-2 contains a number of post-translation cleavage sites in ORF1a and ORF1b that result in a number of nonstructural protein products. Cleavage sites for ORF1a and ORF1b were obtained from UniProt (Consortium, 2019) under entry P0DTD1. In addition, a furin-like cleavage site has been identified in the Spike protein (Wang et al., 2020; Coutard et al., 2020). This cleavage occurs before peptides are loaded in the endoplasmic reticulum for class I or endosomes for class II. Any peptide that spans any of these cleavage sites is removed from consideration. This removes 3,739 peptides out of the 154,996 we consider across windows 8-10 (class I) and 13-25 (class II) (2.4%).

Removal of glycosylated peptides.

Glycosylation is a post-translational modification that involves the covalent attachment of carbohydrates to specific motifs on the surface of the protein. We eliminate all peptides that are predicted to have N-linked glycosylation as it can inhibit MHC class I peptide loading and T cell recognition of peptides (Wolfert and Boons, 2013). We identified peptides that may be glycosylated with the NetNGlyc N-glycosylation prediction server (Gupta et al., 2004) and eliminated peptides where NetNGlyc predicted a non-zero N-glycosylation probability in any residue. This resulted in the elimination of 18,957 of the 154,996 peptides considered (12%).

Self-epitope removal.

T cells are selected to ignore peptides derived from the normal human proteome, and thus we remove any self peptides from consideration for a vaccine. In addition, it is possible that a vaccine might stimulate the adaptive immune system to react to a self peptide that was presented at an abnormally high level, which could lead to an autoimmune disorder. All peptides from SARS-CoV-2 were scanned against the entire human proteome downloaded from UniProt (Consortium, 2019) under Proteome ID UP000005640. A total of 48 exact peptide matches (46 8-mers, two 9-mers) were discovered and eliminated from consideration.

Detailed OptiVax-Robust Vaccine Design Results

Table S1 shows the evaluation of our OptiVax-Robust vaccine designs using the MarginalGreedy algorithm compared to 29 designs by others as baselines (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Additionally, we compare to baseline vaccines containing the full S and S1 proteins (Grifoni et al., 2020b), OptiVax-Robust designed vaccines with peptides stemming from the S, M, and N proteins only (Grifoni et al., 2020a), and random subsets of peptides that are binders (averaged over 200 runs, each consisting of the same number of peptides as the OptiVax-Robust design).

Peptide Set Vaccine Size EvalVax-Robust EvalVax-Robust EvalVax-Robust Exp. # peptide-HLA hits / vaccine size Exp. # peptide-HLA hits (White) Exp. # peptide-HLA hits (Black) Exp. # peptide-HLA hits (Asian)
MHC Class I Peptide Vaccine Evaluation
OptiVax-Robust (Ours) % % %
S-protein % % %
S1-subunit % % %
Srivastava et al. (2020) % % %
Herst et al. (2020) % % %
Random subset of binders % % %
Herst et al. (2020)-top16 % % %
Baruah and Bose (2020) % % %
Fast et al. (2020) % % %
Vashi et al. (2020) % % %
Abdelmageed et al. (2020) % % %
Lee and Koohy (2020) % % %
Ahmed et al. (2020) % % %
Poran et al. (2020) % % %
Singh et al. (2020) % % %
Akhand et al. (2020) % % %
Bhattacharya et al. (2020) % % %
Khan et al. (2020) % % %
Saha and Prasad (2020) % % %
Gupta et al. (2020) % % %
Mitra et al. (2020) % % %
MHC Class II Peptide Vaccine Evaluation
OptiVax-Robust (Ours) % % %
S-protein % % %
Ramaiah and Arumugaswami (2020) % % %
S1-subunit % % %
Random subset of binders % % %
Fast et al. (2020) % % %
Banerjee et al. (2020) % % %
Tahir ul Qamar et al. (2020) % % %
Poran et al. (2020) % % %
Akhand et al. (2020) % % %
Singh et al. (2020) % % %
Ahmed et al. (2020) % % %
Mitra et al. (2020) % % %
Vashi et al. (2020) % % %
Abdelmageed et al. (2020) % % %
Baruah and Bose (2020) % % %
Table S1: Comparison of existing baselines, S-protein peptides, and OptiVax-Robust peptide vaccine designs on various population coverage evaluation metrics. The rows are sorted by EvalVax-Robust . Random subsets are generated 200 times. The binders used for generating random subsets are defined as peptides that are predicted to bind with 50 nM to more than 5 of the alleles.

Evaluation of OptiVax-Robust Vaccine Design and baselines with non-experimentally-calibrated machine learning predictions

We found that MarginalGreedy produced vaccine designs superior to the baselines we tested even when it used immunogenicity models that did not rely upon clinical data of peptide immunogenicity. Since our immunogenicity model used experimental data that was not accessible to the baseline methods, our designs have an advantage over baseline vaccines that did not use calibrated machine learning predictions. We repeated vaccine design using an ensemble of NetMHCpan-4.0 and MHCflurry with 50 nM predicted affinity cutoff for predicting MHC class I immunogenicity and NetMHCIIpan-4.0 for MHC class II. As shown in Figure S1, OptiVax-Robust again shows superior performance on all metrics at all vaccine size under 35, indicating the success of MarginalGreedy in optimizing population coverage with diverse peptide display.

Figure S1: EvalVax population coverage evaluation of SARS-CoV-2 vaccines for (A) MHC class I and (B) MHC class II using non-experimentally calibrated machine learning predictions. (a) EvalVax-Robust population coverage with peptide-HLA hits per individual, OptiVax-Robust performance is shown by the blue curve and baseline performance is shown by red crosses (labeled by name of first author). (b) EvalVax-Robust population coverage with peptide-HLA hits. (c) EvalVax-Robust population coverage with peptide-HLA hits. (d) Comparison of OptiVax-Robust and baselines on expected number of peptide-HLA hits.