Introduction
The COVID19 pandemic has caused an unprecedented challenge for healthcare systems worldwide, and developing a vaccine that produces durable immunity with high population coverage is of upmost importance. A peptide vaccine uses a set of peptides to elicit a protective cellular immune response (Ott et al., 2017; Li et al., 2014). For a peptide to be effective in a vaccine it must be presented by an individual’s Major Histocompatibility Complex (MHC) molecules and be immunogenic. A displayed peptide is immunogenic when it activates T cells, which expand in number and mount a response against pathogens or tumor cells. Memory T cells provide robust immunity against COVID19, even in the absence of detectable antibodies (Sekine et al., 2020). The use of peptide vaccine components to activate T cells is in development for cancer (Hu et al., 2018) and viral diseases including HIV (Arunachalam et al., 2020), HPV (Kenter et al., 2009) and malaria (Li et al., 2014; Rosendahl Huber et al., 2014).
A challenge for the design of peptide vaccines is the diversity of human MHC alleles that each have specific preferences for the peptide sequences they will display. The Human Leukocyte Antigen (HLA) loci, located within the MHC, encode the HLA class I and class II molecules. We consider the three classical class I loci (HLAA, HLAB, and HLAC) and three loci that encode class II molecules (HLADR, HLADQ, and HLADP). An individual’s HLA type describes the alleles they carry at each of these loci. Peptides of length 8–10 residues can bind to HLA class I molecules whereas those of length 13–25 bind to HLA class II molecules (Rist et al., 2013; Chicz et al., 1992). To create effective vaccines it is necessary to consider the HLA allelic frequency in the population, as well as linkage disequilibrium between HLA genes to discover a set of peptides that is likely to be robustly displayed.
For a peptide vaccine to be effective it needs to invoke a robust cellular immune response with a diverse set of peptides (Schultheiß et al., 2020; Grifoni et al., 2020a), which motivates our use of times coverage. Redundant coverage is also important to compensate for our imperfect ability to predict what peptides will be immunogenic in a given individual. We adopt an experimentally calibrated model of peptideHLA immunogenicity to create a set of candidates and perform population coverage optimization to select a compact set of peptides that provides diverse display of peptides in each individual. Our vaccine designs and evaluations are based upon the observed immunogenicity of peptides in convalescent COVID19 patient peripheral blood mononuclear cell samples (Snyder et al., 2020; Klinger et al., 2015; Nolan et al., 2020), and by ensembling machine learning predictions for peptides or HLA alleles that are not observed in the clinical data Liu et al. (2021).
We formalize the number of times that an individual displays peptides as a coverage problem, and thus we generalize single coverage problems to times coverage problems. We show that framing vaccine design as maximum times coverage produces a solution that produces superior predicted population coverage when compared to 29 previous published vaccines for COVID19 with less than 150 peptides. We also show that maximum times coverage is not submodular, and introduce MarginalGreedy, an efficient algorithm for solving times coverage problem on both synthetic data and real vaccine design.
In the maximum times coverage problem, we are provided a set of elements, a weight for each element, and a set of overlays where each overlay specifies an element specific coverage of zero or more times. The goal is to select up to overlays such that the sum of the weights of elements that are covered at least times is maximized. In this framing, an element is a specific collection of HLA alleles (a haplotype), weights are the frequency of haplotypes in the population, is the desired number of peptides displayed by each individual, and an overlay is a peptide that is predicted to be displayed by each haplotype a specified number of times. The solution of the maximum time coverage problem allows us to find a set of overlays (peptides) that maximizes the sum of weights (population coverage).
Our introduction of weighted elements and required times coverage creates a new class of problem without a known solution or complexity bound. The closest problem to the maximum times coverage problem is the multiset multicover problem which does not assign weights to the elements and thus can be formulated as the Covering Integer Program (CIP) problem (Srinivasan, 1999). A time approximation algorithm for CIP can violate coverage constraints Dobson (1982); Kolliopoulos (2003); Kolliopoulos and Young (2001). Other existing robust coverage problems do not address the need for explicit times coverage of elements to ensure each individual is predicted to display multiple peptides. Deletionrobust submodular maximization protects against adversarial deletion (Bogunovic et al., 2018; Mirzasoleiman et al., 2017) and robust submodular optimization protects against objective function uncertainty (Iyer, 2019), but neither guarantees that each element is covered times.
The Maximum times coverage problem
Set Cover and Maximum Coverage
In the standard Set Cover and Maximum Coverage problems, we are given a set of elements (also known as the universe) and a collection of subsets of such that . The goal in the Set Cover problem is to select a minimalcardinality set of subsets from such that their union covers . In the Maximum Coverage problem, our goal is to select subsets from such that the union of these subsets contains maximum number of elements. Both problems are known to be NPhard Feige (1998), but can be solved with a greedy approximation algorithm with the best guaranteed approximation ratio.
A generalization of Set Cover is to use any submodular set function as the objective instead of the cardinality function. Given a set function that assign each subset a value where is a finite set, we define the discrete derivative (or marginal gain) as for . A set function is submodular iff it satisfies the following diminishing return property:
Definition 1.
(Submodularity) A function is submodular if for every and it holds that .
Equivalently, is a submodular function if for every .
We can then define the Submodular Set Cover problem as the problem of selecting a minimalcardinality subset such that , where is a monotone submodular function and is the ground set of items. A similar problem is the Mincost Coverage problem:
The Submodular Maximum Coverage problem is defined as selecting a subset that maximizes with cardinality constraint , where is monotone and submodular.
Given is monotone submodular, a greedy algorithm is known to provide approximation for Submodular Maximum Coverage and approximation for Submodular Set Cover (Nemhauser et al., 1978).
Multiset Multicover problem
The Multiset Multicover (MSMC) problem is a generalization of Set Cover, where multisets are special type of sets in which an element can appear in for more than once. The objective of the MSMC problem is to determine the minimum number of multisets (a multiset can be chosen multiple times) such that each element is covered at least times. It can be formulated into Covering Integer Program (CIP) Srinivasan (1999) problem:
Definition 2.
(Covering Integer Program, CIP) Given , a CIP seeks to minimize , subject to ,and .
Here represents the number of times th element appears in the th multiset. The is set to be all 1 in MSMC. The constraints
are called multiplicity constraints which limit the number of times a single multiset can be reused, and they generally make covering problems much harder as natural linear programming (LP) relaxation can have an unbounded integrality gap
(Chuzhoy and Naor, 2006). Dobson (1982) provides a combinatorial greedy approximation algorithm ( stands for th harmonic number) but multiplicity constraints can be dealt with effectively only in the (0,1) case thus this algorithm can be as bad as polynomial. Kolliopoulos and Young (2001); Kolliopoulos (2003) gave a tighterbound solution that can obtain approximation.Maximum times Coverage
We introduce the maximum times coverage problem, a variant of the MSMC problem that accounts for multiple coverage of each element while also assigning weights to different elements. We are given a set with elements each associated with a nonnegative weight , and a set of overlays . Each overlay covers each element in an element specific number of times , which is similar to a multiset. When the element is not covered by overlay . We use a very strict multiplicity constraint such that each overlay can be used only once. Given a subset of overlays , the total number of times an element is covered by is the sum of for each overlay in :
(1) 
We define the times coverage function as the sum of weights of elements in that are covered at least times by . Figure 1 shows an example computation of the times coverage function.
(2) 
The objective of the Maximum times Coverage problem is to select overlays such that is maximized. This can be formulated as the maximization of the monotone set function under cardinality constraint ,
(3) 
We define Mincost times Coverage as the minimum set of overlays such that the sum of the weights of elements covered at least times is . We assume provides sufficient times coverage for . We define the times Set Cover problem as the special case of Mincost times Coverage where .
(4) 
(5) 
Theorem 1.
The times coverage function is not submodular.
Proof.
We show is not submodular for (a similar counter example can be found for any ). Consider the example overlays in Table 1. When , none of ,, or provides times coverage while covers two times and covers two times. Therefore, the marginal gain of adding , , or into an empty set is always zero, whereas adding into or achieves nonzero gains and adding into achieves even higher gain:
Given that and , the function does not satisfy the diminishing return property (Definition 1) and thus is not submodular. ∎
0.01  0.01  0.5  0.48  

0  1  0  1  
0  0  1  1  
1  1  0  0 
Since the times coverage problem is not submodular, we cannot take advantage of the proven nearoptimal performance of the greedy algorithm for the Submodular Maximum Coverage problem. Thus we seek new solutions.
Vaccine population coverage maximization
Peptide vaccine design can be framed as an instance of the maximum time coverage problem. Peptides must immunogenic to be effective in a vaccine, and we use an integrated model of peptideHLA immunogenicity (Supplemental Data). We define a peptideHLA hit as a (peptide, HLA allele) pair where the peptide is predicted to immunogenic and displayed the HLA allele. Once we have determined a candidate set of peptides that are predicted to be immunogenic we then need to select a minimal subset of peptides such that each individual in of the population is predicted to have peptideHLA hits.
We first introduce EvalVaxRobust
, an evaluation tool for estimating the population coverage of a proposed peptide vaccine set.
EvalVaxRobust evaluates the percentage of the population having at least peptideHLA binding hits in each individual. It takes into consideration allelic zygosity, and utilizes HLA haplotype frequencies for MHC class I (HLAA/B/C) and MHC class II (HLADP/DQ/DR) genes to account for linkage disequilibrium (LD) among loci.The MHC class I genes are encoded by the HLAA, HLAB, and HLAC loci and a single chromosome contains one allele at each locus that is described by , , and respectively. The haplotype frequency of is denoted by and . Each individual inherits two haplotypes to form their diploid genotype. The the frequency of a diploid genotype is thus:
(6) 
For a given peptide and an HLA allele , our machine learning model outputs a binary hit prediction indicating peptideHLA immunogenicity. For each diploid genotype we compute the total number of peptideHLA hits as the sum of over the unique HLA alleles in the genotype. There can be 36 unique alleles depending on the zygosity of each locus:
(7) 
The total number of peptideHLA hits provided by a set of peptides is the sum of number of hits per peptide:
(8) 
We define the EvalVaxRobust predicted population coverage with peptideHLA hits for a peptide vaccine set as:
(9) 
EvalVaxRobust optimization can be accomplished by a solution to maximum times coverage. corresponds to the possible genotypes, and the weights are population genotype frequencies . The peptide candidate set is the set of possible overlays , where each peptide is an overlay which covers a genotype times.
The MarginalGreedy algorithm for maximum times coverage
We observed that greedy solutions to the maximum times coverage problem are problematic when the times objective is directly approached. The greedy approach can fail if it is unable to find early overlays with sufficient times coverage, and can fail later because of early bad overlay choices. Both problems result in subsequent overlay choices not providing sufficient marginal gain to avoid ties and the random selection of subsequent overlays. Thus we needed a solution that did not try to directly solve for times coverage at the outset, but rather marginally approached it.
We propose MarginalGreedy, an efficient algorithm that optimizes times coverage with a sequence of greedy optimization cycles where the th cycle optimizes the coverage function . We establish coverage starting at and increase the criteria to . Thus early overlay selections are guided by less stringent cyclespecific coverage objectives. A set of coverage cutoffs is used as the termination condition for each greedy optimization cycle, and when not specified, we assume by default. We use beam search to keep track of top candidate solutions at each iteration.
The full algorithm is given in Algorithm 1. A similar algorithm can be used to solve the times Set Cover problem in which . For the Maximum times Coverage problem with cardinality constraint , the optimization terminates when . When beam search is not used to reduce computation time (), we have extended the algorithm to break ties during the times coverage iteration by looking ahead to times coverage. We call this extension look ahead tiebreaking.
Another advantage of MarginalGreedy is that it is capable of guaranteeing high coverage for by controlling the cyclespecific coverage cutoffs . This is desired in vaccine design where wider population coverage is also important and we want to make sure that almost 100% of the population will be covered at least once.
Results
MarginalGreedy outperforms a greedy baseline on toy examples
We empirically evaluate our MarginalGreedy algorithm with two toy examples: a LARGE dataset where a set of 30 overlays are randomly generated to cover a set of 10 elements (with equal weights) 0, 1, or 2 times, and a SMALL dataset where a set of 10 overlays that were randomly generated to cover a set of 5 elements with equal weights 0, 1, or 2 times. Figure 2 shows the efficiency of the MarginalGreedy algorithm on both the times Set Cover and Maximum times Coverage problems for varying values of . We compare our MarginalGreedy algorithm to a greedy algorithm that directly optimizes times coverage and find the greedy algorithm degenerates to suboptimal solutions for large values of . These suboptimal solutions can be as bad as random. We repeated problem generation and optimization with 6 different random seeds to provide confidence bounds and used beam size with lookahead tiebreaking. As shown in Figure 2, MarginalGreedy outperforms greedy on all tasks and datasets. We observed that MarginalGreedy has a significant advantage over greedy in tests with larger and in regions of higher coverage. We also compared the performance of polynomialtime MarginalGreedy to an exponentialtime exhaustive search procedure that discovers the optimal set of overlays on smaller problems. We found MarginalGreedy produces solutions close to optimal.
COVID19 vaccine design using maximum times coverage
We used our MarginalGreedy algorithm to design a peptide vaccine for COVID19 using the EvalVaxRobust objective described in (9). We obtained the SARSCoV2 viral proteome from the first documented case from GISAID (Elbe and BucklandMerrett, 2017) (sequence entry Wuhan/IPBCAMSWH01/2019). We used Nextstrain (Hadfield et al., 2018) to identify open reading frames (ORFs) and translate the sequence. We applied sliding windows of length 8–10 (MHC class I) and 13–25 (MHC class II) to identify candidate peptides for inclusion in our peptide vaccine, resulting in 29,403 candidate peptides for MHC class I and 125,593 candidate peptides for MHC class II. We exclude peptides that may be glycosylated Gupta et al. (2004); Wolfert and Boons (2013), cross known cleavage sites Wang et al. (2020); Consortium (2019), overlap with human selfpeptides Consortium (2019), or have a mutation rate (as calculated over 4,690 geographically sampled viral genomes) Elbe and BucklandMerrett (2017); Hadfield et al. (2018) (details in Supplement). We use observed haplotype frequencies of 230 different HLAA, HLAB, and HLAC loci for class I computations and observed haplotype frequencies of 280 different HLADP, HLADQ, and HLADR for class II computations (Liu et al., 2020). This dataset contains independent haplotype frequency measurements for three populations selfreporting as having White, Black, or Asian ancestry. We score peptideHLA immunogenicity based upon clinical data from convalescent COVID19 patients Liu et al. (2021). For peptideHLA combinations not observed in our clinical data we selected a machine learning model of immunogenicity that best predicted the peptideHLA combinations we did observe. Our selected MHC class I model predicts a peptideHLA combination will be immunogenic if they are predicted to bind with an affinity of 50nM by the mean of the predictions from PUFFIN (Zeng and Gifford, 2019), NetMHCpan4.0 (Jurtz et al., 2017) and MHCflurry 2.0 (O’Donnell et al., 2020). Our selected MHC class II model predicts a peptideHLA combination will be immunogenic if they are predicted to bind with an affinity of 50 nM by NetMHCIIpan4.0.
We implemented the OptiVaxRobust method for peptide vaccine design using the MarginalGreedy algorithm, but modified MarginalGreedy to reduce vaccine redundancy by eliminating unselected peptides that are within three (MHC class I) or five (MHC class II) edits on a sequence distance metric from the selected peptides at each iteration. We used a beam size of 10 for MHC class I and 5 for MHC class II.
Contemporary peptide vaccine design methods use machine learning scoring of peptide display for specific HLA alleles followed by the manual selection of peptide sets or a greedy search that maximizes coverage at =1. These methods do not utilize the distribution of HLA haplotypes in a population and thus can not accurately assess the population coverage provided by a vaccine. For a selected set of peptides, the IEDB Population Coverage Tool Bui et al. (2006) estimates peptideMHC binding coverage and the distribution of peptides displayed for a given population but assumes independence between different loci and thus does not consider linkage disequilibrium. However, the IEDB tool does not provide automated peptide selection as we describe.
We compared our vaccine designs with 29 vaccine designs in the literature on the probability that a vaccine has at least
peptideHLA hits per individual in a population, and the expected number of per individual peptideHLA hits in the population, which provides insight on how well a vaccine is displayed on average (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Figure 3shows the comparison between OptiVaxRobust (A) MHC class I and (B) class II vaccine designs at all vaccine sizes (top solution in the beam up to the given vaccine size) from 145 peptides (blue curves) and baseline vaccines (red crosses) from prior work. We observe superior performance of OptiVaxRobust vaccine designs on all evaluation metrics at all vaccine sizes for both MHC class I and class II. Most baselines achieve reasonable coverage at
peptide hits. However, many fail to show a high probability of higher hit counts, indicating a lack of predicted redundancy if a single peptide is not displayed. Note that OptiVaxRobust also outperforms all baselines on coverage and achieve 99.99% (MHC class I) and 99.67% (MHC class II) coverage for , suggesting its capability to cover almost full population at least once while optimizing for higher peptide display diversity. We also found that OptiVaxRobust outperforms all baselines with an immunogenicity model that does not incorporate clinical data (Supplement Figure S1).MHC class I results.
We selected an optimized set of peptides from all SARSCoV2 proteins using OptiVax. We use our MHC class I integrated model of peptideHLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we considered 1546 candidate peptides. With OptiVaxRobust optimization, we design a vaccine with 18 peptides that achieves 99.999% EvalVaxRobust coverage over populations selfreporting as having Asian, Black, or White ancestry with at least one peptideHLA hit per individual. This set of peptides also provides 99.53% coverage with at least 5 peptideHLA hits and 92.32% coverage with at least 8 peptideHLA hits (Figure 4A, Table S1). The population level distribution of the number of peptideHLA hits in White, Black, and Asian populations is shown in Figure 4A, where the expected number of peptideHLA hits 12.532, 11.425 and 13.326, respectively.
MHC class II results.
We use our MHC class II model of peptideHLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we had 8003 candidate peptides. With OptiVaxRobust optimization, we design a vaccine with 18 peptides that achieves 99.67% EvalVaxRobust coverage with at least one peptideHLA hit per individual. This set of peptides also provides 97.22% coverage with at least 5 peptideHLA hits and 87.99% coverage with at least 8 peptideHLA hits (Figure 4B, Table S1). The population level distribution of the number of peptideHLA hits per individual in populations selfreporting as having Asian, Black, or White ancestry is shown in Figure 4B, where the expected number of peptideHLA hits is 17.206, 16.085 and 11.210, respectively.
Conclusion
We introduced the maximum times coverage problem, and showed that it is not submodular. We presented a beam search algorithm for solving the problem, and used it to produce a peptide vaccine design for COVID19. We compared the resulting optimized peptide vaccine design with 29 other published designs and found that the optimized design provides substantially better population coverage for both MHC class I and class II presentation of viral peptides. The use of times coverage as an objective increases vaccine redundancy and thus the probability that one of the presented peptides will be immunogenic and produce a T cell response. Our methods can also be used to augment existing vaccine designs with peptides to improve their predicted population coverage by initializing the algorithm with peptides that are present in an existing design.
Ethics Statement
This work presents a principled combinatorial design approach for developing peptide vaccines using machine learning methods, and we apply it to develop a novel MHC class I and II vaccine design for COVID19. On our metrics, the resulting design is predicted to provide better population coverage than 29 existing vaccine designs with less than 150 peptides. More broadly, the EvalVax and OptiVax methods we introduce may be used to evaluate and design peptide vaccines against viral infections or cancer to achieve greater population coverage and robust peptide display per individual over existing methods. The general formulation of the maximum times coverage problem may find use in other fields, such as economics. One potential consequence of this work is identifying vaccines that do not provide adequate population coverage that could lead to costly failures of vaccine trials. We aim to mitigate this risk via our principled vaccine formulation objective and conservative approach for defining the candidate peptide set.
Acknowledgements
This work was supported in part by Schmidt Futures, a Google Cloud Platform grant, NIH grant R01CA218094, and a C3.AI DTI Award to D.K.G. Ge Liu’s contribution was made prior to joining Amazon.
Code Availability
An open source implementation of the methods described can be found at
https://github.com/giffordlab/optivax.Author Contributions
G.L., B.C., and D.G. contributed to problem definition and solution. G.L. designed and implemented the optimization procedure, with advice from B.C. and D.G. G.L., B.C., and D.G. wrote the paper.
Declaration of Interests
Brandon Carter is an employee of Think Therapeutics, and Ge Liu is an employee of Amazon. David Gifford and Brandon Carter have a financial interest in Think Therapeutics, Inc.
References
 Design of multi epitopebased peptide vaccine against E protein of human 2019nCoV: an immunoinformatics approach. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Preliminary identification of potential vaccine targets for the COVID19 coronavirus (SARSCoV2) based on SARSCoV immunological studies. Viruses 12 (3), pp. 254. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Genome based evolutionary study of SARSCoV2 towards the prediction of epitope based chimeric vaccine. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 T cellinducing vaccine durably prevents mucosal SHIV infection even with lower neutralizing antibody titers. Nature Medicine. Cited by: Introduction.
 Energetics based epitope screening in SARS CoV2 (COVID 19) spike glycoprotein by immunoinformatic analysis aiming to a suitable vaccine development. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Immunoinformaticsaided identification of T cell and B cell epitopes in the surface glycoprotein of 2019nCoV. Journal of Medical Virology. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Development of epitopebased peptide vaccine against novel coronavirus 2019 (SARSCOV2): immunoinformatics approach. Journal of medical virology 92 (6), pp. 618–631. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Robust maximization of nonsubmodular objectives. In AISTATS, Cited by: Introduction.
 Predicting population coverage of Tcell epitopebased diagnostics and vaccines. BMC Bioinformatics 7 (1), pp. 153. Cited by: COVID19 vaccine design using maximum times coverage.
 Predominant naturally processed peptides bound to HLADR1 are derived from MHCrelated molecules and are heterogeneous in size. Nature 358 (6389), pp. 764–768. Cited by: Introduction.
 Covering problems with hard capacities. SIAM Journal on Computing 36 (2), pp. 498–515. Cited by: Multiset Multicover problem.
 UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47 (D1), pp. D506–D515. Cited by: COVID19 vaccine design using maximum times coverage, Removal of cleavage regions., Selfepitope removal..
 The spike glycoprotein of the new coronavirus 2019nCoV contains a furinlike cleavage site absent in CoV of the same clade. Antiviral research 176, pp. 104742. Cited by: Removal of cleavage regions..

Worstcase analysis of greedy heuristics for integer programming with nonnegative data
. Mathematics of Operations Research 7 (4), pp. 515–531. Cited by: Introduction, Multiset Multicover problem.  Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges 1 (1), pp. 33–46. Cited by: COVID19 vaccine design using maximum times coverage, Removal of mutable peptides..
 Potential tcell and bcell epitopes of 2019ncov. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 A threshold of ln n for approximating set cover. Journal of the ACM (JACM) 45 (4), pp. 634–652. Cited by: Set Cover and Maximum Coverage.
 Targets of T cell responses to SARSCoV2 coronavirus in humans with COVID19 disease and unexposed individuals. Cell. External Links: Document Cited by: Introduction, Detailed OptiVaxRobust Vaccine Design Results.
 A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARSCoV2. Cell Host & Microbe 27 (4), pp. 671–680. Cited by: Detailed OptiVaxRobust Vaccine Design Results.
 Identification of potential vaccine candidates against SARSCoV2, a step forward to fight novel coronavirus 2019nCoV: a reverse vaccinology approach. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Prediction of Nglycosylation sites in human proteins. In preparation. External Links: Link Cited by: COVID19 vaccine design using maximum times coverage, Removal of glycosylated peptides..
 Nextstrain: realtime tracking of pathogen evolution. Bioinformatics 34 (23), pp. 4121–4123. Cited by: COVID19 vaccine design using maximum times coverage, Removal of mutable peptides..
 An effective CTL peptide vaccine for ebola zaire based on survivors’ CD8+ targeting of a particular nucleocapsid protein epitope with potential implications for COVID19 vaccine design. Vaccine. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Towards personalized, tumourspecific, therapeutic vaccines for cancer. Nature Reviews Immunology 18 (3), pp. 168. Cited by: Introduction.
 A unified framework of robust submodular optimization. arXiv preprint arXiv:1906.06393. Cited by: Introduction.
 NetMHCpan4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. The Journal of Immunology 199 (9), pp. 3360–3368. Cited by: COVID19 vaccine design using maximum times coverage.
 Vaccination against HPV16 oncoproteins for vulvar intraepithelial neoplasia. New England Journal of Medicine 361 (19), pp. 1838–1847. Cited by: Introduction.
 Design of an epitopebased peptide vaccine against the Severe Acute Respiratory Syndrome Coronavirus2 (SARSCoV2): a vaccine informatics approach. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Multiplex identification of antigenspecific T cell receptors using a combination of immune assays and immune receptor sequencing. PLoS One 10 (10), pp. e0141561. Cited by: Introduction.
 Tight approximation results for general covering integer programs. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 522–528. Cited by: Introduction, Multiset Multicover problem.
 Approximating covering integer programs with multiplicity constraints. Discrete applied mathematics 129 (23), pp. 461–473. Cited by: Introduction, Multiset Multicover problem.
 In silico identification of vaccine targets for 2019nCoV. F1000Research 9. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Peptide vaccine: progress and challenges. Vaccines 2 (3), pp. 515–536. Cited by: Introduction.
 Computationally optimized SARSCoV2 MHC class I and II vaccine formulations predicted to target human haplotype distributions. Cell Systems 11 (2), pp. 131–144. Cited by: COVID19 vaccine design using maximum times coverage.
 Predicted cellular immunity population coverage gaps for SARSCoV2 subunit vaccines and their augmentation by compact peptide sets. Cell Systems 12 (1), pp. 102–107. Cited by: Introduction, COVID19 vaccine design using maximum times coverage.
 Deletionrobust submodular maximization: data summarization with the right to be forgotten. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2449–2458. Cited by: Introduction.
 Multiepitope based peptide vaccine design against SARSCoV2 using its spike protein. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 An analysis of approximations for maximizing submodular set functionsI. Mathematical programming 14 (1), pp. 265–294. Cited by: Set Cover and Maximum Coverage.
 A largescale database of Tcell receptor beta (TCR) sequences and binding associations from natural and synthetic exposure to SARSCoV2. Research Square. Cited by: Introduction.
 MHCflurry 2.0: improved panallele prediction of MHC class Ipresented peptides by incorporating antigen processing. Cell Systems 11 (1), pp. 42–48. Cited by: COVID19 vaccine design using maximum times coverage.
 An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547 (7662), pp. 217–221. Cited by: Introduction.
 Sequencebased prediction of vaccine targets for inducing T cell responses to SARSCoV2 utilizing the bioinformatics predictor RECON. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Insights into crossspecies evolution of novel human coronavirus 2019nCoV and defining immune determinants for vaccine development. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 HLA peptide length preferences control CD8+ T cell responses. The Journal of Immunology 191 (2), pp. 561–571. Cited by: Introduction.
 T cell responses to viral infections–opportunities for peptide vaccination. Frontiers in immunology 5, pp. 171. Cited by: Introduction.
 In silico approach for designing of a multiepitope based vaccine against novel Coronavirus (SARSCOV2). bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Nextgeneration sequencing of T and B cell receptor repertoires from COVID19 patients showed signatures associated with severity of disease. Immunity 53 (2), pp. 442–455. Cited by: Introduction.
 Robust T cell immunity in convalescent individuals with asymptomatic or mild COVID19. Cell 183, pp. 158–168. Cited by: Introduction.
 Designing a multiepitope peptidebased vaccine against SARSCoV2. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Magnitude and dynamics of the TCell response to SARSCoV2 infection at both individual and population levels. medRxiv. Cited by: Introduction.
 Improved approximation guarantees for packing and covering integer programs. SIAM Journal on Computing 29 (2), pp. 648–670. Cited by: Introduction, Multiset Multicover problem.
 Structural basis to design multiepitope vaccines against Novel Coronavirus 19 (COVID19) infection, the ongoing pandemic emergency: an in silico approach. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Designing of a next generation multiepitope based vaccine (MEV) against SARSCOV2: immunoinformatics and in silico approaches. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 Understanding the B and T cells epitopes of spike protein of severe respiratory syndrome coronavirus2: a computational way to predict the immunogens. bioRxiv. Cited by: COVID19 vaccine design using maximum times coverage, Table S1, Detailed OptiVaxRobust Vaccine Design Results.
 A unique protease cleavage site predicted in the spike protein of the novel pneumonia coronavirus (2019ncov) potentially related to viral transmissibility. Virologica Sinica, pp. 1–3. Cited by: COVID19 vaccine design using maximum times coverage, Removal of cleavage regions..
 Adaptive immune activation: glycosylation does matter. Nature Chemical Biology 9 (12), pp. 776–784. External Links: Document Cited by: COVID19 vaccine design using maximum times coverage, Removal of glycosylated peptides..
 Quantification of uncertainty in peptideMHC binding prediction improves highaffinity peptide selection for therapeutic design. Cell Systems 9 (2), pp. 159–166. Cited by: COVID19 vaccine design using maximum times coverage.
Details of Candidate Peptide Filtering
For our SARSCoV2 peptide vaccine design, we eliminate peptides that are expected to mutate and thus cause vaccine escape, peptides crossing cleavage sites, peptides that may be glycosylated, and peptides that are identical to peptides in the human proteome.
Removal of mutable peptides.
We eliminate peptides that are observed to mutate above an input threshold rate (0.001) to improve coverage over all SARSCoV2 variants and reduce the chance that the virus will mutate and escape vaccineinduced immunity in the future. When possible, we select peptides that are observed to be perfectly conserved across all observed SARSCoV2 viral genomes. Peptides that are observed to be perfectly conserved in thousands of examples may be functionally constrained to evolve slowly or not at all.
For SARSCoV2, we obtained the most up to date version of the GISAID database (Elbe and BucklandMerrett, 2017) (as of 2:02pm EST May 13, 2020, acknowledgements in Supplementary table) and used Nextstrain (Hadfield et al., 2018)^{1}^{1}1from GitHub commit 639c63f25e0bf30c900f8d3d937de4063d96f791 to remove genomes with sequencing errors, translate the genome into proteins, and perform multiple sequence alignments (MSAs). We retrieved 24468 sequences from GISAID, and 19288 remained after Nextstrain quality processing. After quality processing, Nextstrain randomly sampled 34 genomes from every geographic region and month to produce a representative set of 5142 genomes for evolutionary analysis. Nextstrain definition of a “region” can vary from a city (e.g., “Shanghai”) to a larger geographical district. Spatial and temporal sampling in Nextstrain is designed to provide a representative sampling of sequences around the world.
The 5142 genomes sampled by Nextstrain were then translated into protein sequences and aligned. We eliminated viral genome sequences that had a stop codon, a gap, an unknown amino acid (because of an uncalled nucleotide in the codon), or had a gene that lacked a starting methionine, except for ORF1b which does not begin with a methionine. This left a total of 4690 sequences that were used to compute peptide level mutation probabilities. For each peptide, the probability of mutation was computed as the number of nonreference peptide sequences observed divided by the total number of peptide sequences observed.
Removal of cleavage regions.
SARSCoV2 contains a number of posttranslation cleavage sites in ORF1a and ORF1b that result in a number of nonstructural protein products. Cleavage sites for ORF1a and ORF1b were obtained from UniProt (Consortium, 2019) under entry P0DTD1. In addition, a furinlike cleavage site has been identified in the Spike protein (Wang et al., 2020; Coutard et al., 2020). This cleavage occurs before peptides are loaded in the endoplasmic reticulum for class I or endosomes for class II. Any peptide that spans any of these cleavage sites is removed from consideration. This removes 3,739 peptides out of the 154,996 we consider across windows 810 (class I) and 1325 (class II) (2.4%).
Removal of glycosylated peptides.
Glycosylation is a posttranslational modification that involves the covalent attachment of carbohydrates to specific motifs on the surface of the protein. We eliminate all peptides that are predicted to have Nlinked glycosylation as it can inhibit MHC class I peptide loading and T cell recognition of peptides (Wolfert and Boons, 2013). We identified peptides that may be glycosylated with the NetNGlyc Nglycosylation prediction server (Gupta et al., 2004) and eliminated peptides where NetNGlyc predicted a nonzero Nglycosylation probability in any residue. This resulted in the elimination of 18,957 of the 154,996 peptides considered (12%).
Selfepitope removal.
T cells are selected to ignore peptides derived from the normal human proteome, and thus we remove any self peptides from consideration for a vaccine. In addition, it is possible that a vaccine might stimulate the adaptive immune system to react to a self peptide that was presented at an abnormally high level, which could lead to an autoimmune disorder. All peptides from SARSCoV2 were scanned against the entire human proteome downloaded from UniProt (Consortium, 2019) under Proteome ID UP000005640. A total of 48 exact peptide matches (46 8mers, two 9mers) were discovered and eliminated from consideration.
Detailed OptiVaxRobust Vaccine Design Results
Table S1 shows the evaluation of our OptiVaxRobust vaccine designs using the MarginalGreedy algorithm compared to 29 designs by others as baselines (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Additionally, we compare to baseline vaccines containing the full S and S1 proteins (Grifoni et al., 2020b), OptiVaxRobust designed vaccines with peptides stemming from the S, M, and N proteins only (Grifoni et al., 2020a), and random subsets of peptides that are binders (averaged over 200 runs, each consisting of the same number of peptides as the OptiVaxRobust design).
Peptide Set  Vaccine Size  EvalVaxRobust  EvalVaxRobust  EvalVaxRobust  Exp. # peptideHLA hits / vaccine size  Exp. # peptideHLA hits (White)  Exp. # peptideHLA hits (Black)  Exp. # peptideHLA hits (Asian) 

MHC Class I Peptide Vaccine Evaluation  
OptiVaxRobust (Ours)  %  %  %  
Sprotein  %  %  %  
S1subunit  %  %  %  
Srivastava et al. (2020)  %  %  %  
Herst et al. (2020)  %  %  %  
Random subset of binders  %  %  %  
Herst et al. (2020)top16  %  %  %  
Baruah and Bose (2020)  %  %  %  
Fast et al. (2020)  %  %  %  
Vashi et al. (2020)  %  %  %  
Abdelmageed et al. (2020)  %  %  %  
Lee and Koohy (2020)  %  %  %  
Ahmed et al. (2020)  %  %  %  
Poran et al. (2020)  %  %  %  
Singh et al. (2020)  %  %  %  
Akhand et al. (2020)  %  %  %  
Bhattacharya et al. (2020)  %  %  %  
Khan et al. (2020)  %  %  %  
Saha and Prasad (2020)  %  %  %  
Gupta et al. (2020)  %  %  %  
Mitra et al. (2020)  %  %  %  
MHC Class II Peptide Vaccine Evaluation  
OptiVaxRobust (Ours)  %  %  %  
Sprotein  %  %  %  
Ramaiah and Arumugaswami (2020)  %  %  %  
S1subunit  %  %  %  
Random subset of binders  %  %  %  
Fast et al. (2020)  %  %  %  
Banerjee et al. (2020)  %  %  %  
Tahir ul Qamar et al. (2020)  %  %  %  
Poran et al. (2020)  %  %  %  
Akhand et al. (2020)  %  %  %  
Singh et al. (2020)  %  %  %  
Ahmed et al. (2020)  %  %  %  
Mitra et al. (2020)  %  %  %  
Vashi et al. (2020)  %  %  %  
Abdelmageed et al. (2020)  %  %  %  
Baruah and Bose (2020)  %  %  % 
Evaluation of OptiVaxRobust Vaccine Design and baselines with nonexperimentallycalibrated machine learning predictions
We found that MarginalGreedy produced vaccine designs superior to the baselines we tested even when it used immunogenicity models that did not rely upon clinical data of peptide immunogenicity. Since our immunogenicity model used experimental data that was not accessible to the baseline methods, our designs have an advantage over baseline vaccines that did not use calibrated machine learning predictions. We repeated vaccine design using an ensemble of NetMHCpan4.0 and MHCflurry with 50 nM predicted affinity cutoff for predicting MHC class I immunogenicity and NetMHCIIpan4.0 for MHC class II. As shown in Figure S1, OptiVaxRobust again shows superior performance on all metrics at all vaccine size under 35, indicating the success of MarginalGreedy in optimizing population coverage with diverse peptide display.
Comments
There are no comments yet.