The COVID-19 pandemic has caused an unprecedented challenge for healthcare systems worldwide, and developing a vaccine that produces durable immunity with high population coverage is of upmost importance. A peptide vaccine uses a set of peptides to elicit a protective cellular immune response (Ott et al., 2017; Li et al., 2014). For a peptide to be effective in a vaccine it must be presented by an individual’s Major Histocompatibility Complex (MHC) molecules and be immunogenic. A displayed peptide is immunogenic when it activates T cells, which expand in number and mount a response against pathogens or tumor cells. Memory T cells provide robust immunity against COVID-19, even in the absence of detectable antibodies (Sekine et al., 2020). The use of peptide vaccine components to activate T cells is in development for cancer (Hu et al., 2018) and viral diseases including HIV (Arunachalam et al., 2020), HPV (Kenter et al., 2009) and malaria (Li et al., 2014; Rosendahl Huber et al., 2014).
A challenge for the design of peptide vaccines is the diversity of human MHC alleles that each have specific preferences for the peptide sequences they will display. The Human Leukocyte Antigen (HLA) loci, located within the MHC, encode the HLA class I and class II molecules. We consider the three classical class I loci (HLA-A, HLA-B, and HLA-C) and three loci that encode class II molecules (HLA-DR, HLA-DQ, and HLA-DP). An individual’s HLA type describes the alleles they carry at each of these loci. Peptides of length 8–10 residues can bind to HLA class I molecules whereas those of length 13–25 bind to HLA class II molecules (Rist et al., 2013; Chicz et al., 1992). To create effective vaccines it is necessary to consider the HLA allelic frequency in the population, as well as linkage disequilibrium between HLA genes to discover a set of peptides that is likely to be robustly displayed.
For a peptide vaccine to be effective it needs to invoke a robust cellular immune response with a diverse set of peptides (Schultheiß et al., 2020; Grifoni et al., 2020a), which motivates our use of -times coverage. Redundant coverage is also important to compensate for our imperfect ability to predict what peptides will be immunogenic in a given individual. We adopt an experimentally calibrated model of peptide-HLA immunogenicity to create a set of candidates and perform population coverage optimization to select a compact set of peptides that provides diverse display of peptides in each individual. Our vaccine designs and evaluations are based upon the observed immunogenicity of peptides in convalescent COVID-19 patient peripheral blood mononuclear cell samples (Snyder et al., 2020; Klinger et al., 2015; Nolan et al., 2020), and by ensembling machine learning predictions for peptides or HLA alleles that are not observed in the clinical data Liu et al. (2021).
We formalize the number of times that an individual displays peptides as a coverage problem, and thus we generalize single coverage problems to -times coverage problems. We show that framing vaccine design as maximum -times coverage produces a solution that produces superior predicted population coverage when compared to 29 previous published vaccines for COVID-19 with less than 150 peptides. We also show that maximum -times coverage is not submodular, and introduce MarginalGreedy, an efficient algorithm for solving -times coverage problem on both synthetic data and real vaccine design.
In the maximum -times coverage problem, we are provided a set of elements, a weight for each element, and a set of overlays where each overlay specifies an element specific coverage of zero or more times. The goal is to select up to overlays such that the sum of the weights of elements that are covered at least times is maximized. In this framing, an element is a specific collection of HLA alleles (a haplotype), weights are the frequency of haplotypes in the population, is the desired number of peptides displayed by each individual, and an overlay is a peptide that is predicted to be displayed by each haplotype a specified number of times. The solution of the maximum -time coverage problem allows us to find a set of overlays (peptides) that maximizes the sum of weights (population coverage).
Our introduction of weighted elements and required -times coverage creates a new class of problem without a known solution or complexity bound. The closest problem to the maximum -times coverage problem is the multi-set multi-cover problem which does not assign weights to the elements and thus can be formulated as the Covering Integer Program (CIP) problem (Srinivasan, 1999). A -time approximation algorithm for CIP can violate coverage constraints Dobson (1982); Kolliopoulos (2003); Kolliopoulos and Young (2001). Other existing robust coverage problems do not address the need for explicit -times coverage of elements to ensure each individual is predicted to display multiple peptides. Deletion-robust submodular maximization protects against adversarial deletion (Bogunovic et al., 2018; Mirzasoleiman et al., 2017) and robust submodular optimization protects against objective function uncertainty (Iyer, 2019), but neither guarantees that each element is covered times.
The Maximum -times coverage problem
Set Cover and Maximum Coverage
In the standard Set Cover and Maximum Coverage problems, we are given a set of elements (also known as the universe) and a collection of subsets of such that . The goal in the Set Cover problem is to select a minimal-cardinality set of subsets from such that their union covers . In the Maximum Coverage problem, our goal is to select subsets from such that the union of these subsets contains maximum number of elements. Both problems are known to be NP-hard Feige (1998), but can be solved with a greedy approximation algorithm with the best guaranteed approximation ratio.
A generalization of Set Cover is to use any submodular set function as the objective instead of the cardinality function. Given a set function that assign each subset a value where is a finite set, we define the discrete derivative (or marginal gain) as for . A set function is submodular iff it satisfies the following diminishing return property:
(Submodularity) A function is submodular if for every and it holds that .
Equivalently, is a submodular function if for every .
We can then define the Submodular Set Cover problem as the problem of selecting a minimal-cardinality subset such that , where is a monotone submodular function and is the ground set of items. A similar problem is the Min-cost Coverage problem:
The Submodular Maximum Coverage problem is defined as selecting a subset that maximizes with cardinality constraint , where is monotone and submodular.
Given is monotone submodular, a greedy algorithm is known to provide approximation for Submodular Maximum Coverage and approximation for Submodular Set Cover (Nemhauser et al., 1978).
Multi-set Multi-cover problem
The Multi-set Multi-cover (MSMC) problem is a generalization of Set Cover, where multi-sets are special type of sets in which an element can appear in for more than once. The objective of the MSMC problem is to determine the minimum number of multisets (a multi-set can be chosen multiple times) such that each element is covered at least times. It can be formulated into Covering Integer Program (CIP) Srinivasan (1999) problem:
(Covering Integer Program, CIP) Given , a CIP seeks to minimize , subject to ,and .
Here represents the number of times -th element appears in the -th multi-set. The is set to be all 1 in MSMC. The constraints
are called multiplicity constraints which limit the number of times a single multi-set can be reused, and they generally make covering problems much harder as natural linear programming (LP) relaxation can have an unbounded integrality gap(Chuzhoy and Naor, 2006). Dobson (1982) provides a combinatorial greedy -approximation algorithm ( stands for -th harmonic number) but multiplicity constraints can be dealt with effectively only in the (0,1) case thus this algorithm can be as bad as polynomial. Kolliopoulos and Young (2001); Kolliopoulos (2003) gave a tighter-bound solution that can obtain -approximation.
Maximum -times Coverage
We introduce the maximum -times coverage problem, a variant of the MSMC problem that accounts for multiple coverage of each element while also assigning weights to different elements. We are given a set with elements each associated with a non-negative weight , and a set of overlays . Each overlay covers each element in an element specific number of times , which is similar to a multi-set. When the element is not covered by overlay . We use a very strict multiplicity constraint such that each overlay can be used only once. Given a subset of overlays , the total number of times an element is covered by is the sum of for each overlay in :
We define the -times coverage function as the sum of weights of elements in that are covered at least times by . Figure 1 shows an example computation of the -times coverage function.
The objective of the Maximum -times Coverage problem is to select overlays such that is maximized. This can be formulated as the maximization of the monotone set function under cardinality constraint ,
We define Min-cost -times Coverage as the minimum set of overlays such that the sum of the weights of elements covered at least times is . We assume provides sufficient -times coverage for . We define the -times Set Cover problem as the special case of Min-cost -times Coverage where .
The -times coverage function is not submodular.
We show is not submodular for (a similar counter example can be found for any ). Consider the example overlays in Table 1. When , none of ,, or provides -times coverage while covers two times and covers two times. Therefore, the marginal gain of adding , , or into an empty set is always zero, whereas adding into or achieves non-zero gains and adding into achieves even higher gain:
Given that and , the function does not satisfy the diminishing return property (Definition 1) and thus is not submodular. ∎
Since the -times coverage problem is not submodular, we cannot take advantage of the proven near-optimal performance of the greedy algorithm for the Submodular Maximum Coverage problem. Thus we seek new solutions.
Vaccine population coverage maximization
Peptide vaccine design can be framed as an instance of the maximum -time coverage problem. Peptides must immunogenic to be effective in a vaccine, and we use an integrated model of peptide-HLA immunogenicity (Supplemental Data). We define a peptide-HLA hit as a (peptide, HLA allele) pair where the peptide is predicted to immunogenic and displayed the HLA allele. Once we have determined a candidate set of peptides that are predicted to be immunogenic we then need to select a minimal subset of peptides such that each individual in of the population is predicted to have peptide-HLA hits.
We first introduce EvalVax-Robust
, an evaluation tool for estimating the population coverage of a proposed peptide vaccine set.EvalVax-Robust evaluates the percentage of the population having at least peptide-HLA binding hits in each individual. It takes into consideration allelic zygosity, and utilizes HLA haplotype frequencies for MHC class I (HLA-A/B/C) and MHC class II (HLA-DP/DQ/DR) genes to account for linkage disequilibrium (LD) among loci.
The MHC class I genes are encoded by the HLA-A, HLA-B, and HLA-C loci and a single chromosome contains one allele at each locus that is described by , , and respectively. The haplotype frequency of is denoted by and . Each individual inherits two haplotypes to form their diploid genotype. The the frequency of a diploid genotype is thus:
For a given peptide and an HLA allele , our machine learning model outputs a binary hit prediction indicating peptide-HLA immunogenicity. For each diploid genotype we compute the total number of peptide-HLA hits as the sum of over the unique HLA alleles in the genotype. There can be 3-6 unique alleles depending on the zygosity of each locus:
The total number of peptide-HLA hits provided by a set of peptides is the sum of number of hits per peptide:
We define the EvalVax-Robust predicted population coverage with peptide-HLA hits for a peptide vaccine set as:
EvalVax-Robust optimization can be accomplished by a solution to maximum -times coverage. corresponds to the possible genotypes, and the weights are population genotype frequencies . The peptide candidate set is the set of possible overlays , where each peptide is an overlay which covers a genotype times.
The MarginalGreedy algorithm for maximum -times coverage
We observed that greedy solutions to the maximum -times coverage problem are problematic when the -times objective is directly approached. The greedy approach can fail if it is unable to find early overlays with sufficient -times coverage, and can fail later because of early bad overlay choices. Both problems result in subsequent overlay choices not providing sufficient marginal gain to avoid ties and the random selection of subsequent overlays. Thus we needed a solution that did not try to directly solve for -times coverage at the outset, but rather marginally approached it.
We propose MarginalGreedy, an efficient algorithm that optimizes -times coverage with a sequence of greedy optimization cycles where the -th cycle optimizes the coverage function . We establish coverage starting at and increase the criteria to . Thus early overlay selections are guided by less stringent cycle-specific coverage objectives. A set of coverage cutoffs is used as the termination condition for each greedy optimization cycle, and when not specified, we assume by default. We use beam search to keep track of top candidate solutions at each iteration.
The full algorithm is given in Algorithm 1. A similar algorithm can be used to solve the -times Set Cover problem in which . For the Maximum -times Coverage problem with cardinality constraint , the optimization terminates when . When beam search is not used to reduce computation time (), we have extended the algorithm to break ties during the -times coverage iteration by looking ahead to -times coverage. We call this extension look ahead tie-breaking.
Another advantage of MarginalGreedy is that it is capable of guaranteeing high coverage for by controlling the cycle-specific coverage cutoffs . This is desired in vaccine design where wider population coverage is also important and we want to make sure that almost 100% of the population will be covered at least once.
MarginalGreedy outperforms a greedy baseline on toy examples
We empirically evaluate our MarginalGreedy algorithm with two toy examples: a LARGE dataset where a set of 30 overlays are randomly generated to cover a set of 10 elements (with equal weights) 0, 1, or 2 times, and a SMALL dataset where a set of 10 overlays that were randomly generated to cover a set of 5 elements with equal weights 0, 1, or 2 times. Figure 2 shows the efficiency of the MarginalGreedy algorithm on both the -times Set Cover and Maximum -times Coverage problems for varying values of . We compare our MarginalGreedy algorithm to a greedy algorithm that directly optimizes -times coverage and find the greedy algorithm degenerates to sub-optimal solutions for large values of . These sub-optimal solutions can be as bad as random. We repeated problem generation and optimization with 6 different random seeds to provide confidence bounds and used beam size with look-ahead tie-breaking. As shown in Figure 2, MarginalGreedy outperforms greedy on all tasks and datasets. We observed that MarginalGreedy has a significant advantage over greedy in tests with larger and in regions of higher coverage. We also compared the performance of polynomial-time MarginalGreedy to an exponential-time exhaustive search procedure that discovers the optimal set of overlays on smaller problems. We found MarginalGreedy produces solutions close to optimal.
COVID-19 vaccine design using maximum -times coverage
We used our MarginalGreedy algorithm to design a peptide vaccine for COVID-19 using the EvalVax-Robust objective described in (9). We obtained the SARS-CoV-2 viral proteome from the first documented case from GISAID (Elbe and Buckland-Merrett, 2017) (sequence entry Wuhan/IPBCAMS-WH-01/2019). We used Nextstrain (Hadfield et al., 2018) to identify open reading frames (ORFs) and translate the sequence. We applied sliding windows of length 8–10 (MHC class I) and 13–25 (MHC class II) to identify candidate peptides for inclusion in our peptide vaccine, resulting in 29,403 candidate peptides for MHC class I and 125,593 candidate peptides for MHC class II. We exclude peptides that may be glycosylated Gupta et al. (2004); Wolfert and Boons (2013), cross known cleavage sites Wang et al. (2020); Consortium (2019), overlap with human self-peptides Consortium (2019), or have a mutation rate (as calculated over 4,690 geographically sampled viral genomes) Elbe and Buckland-Merrett (2017); Hadfield et al. (2018) (details in Supplement). We use observed haplotype frequencies of 230 different HLA-A, HLA-B, and HLA-C loci for class I computations and observed haplotype frequencies of 280 different HLA-DP, HLA-DQ, and HLA-DR for class II computations (Liu et al., 2020). This dataset contains independent haplotype frequency measurements for three populations self-reporting as having White, Black, or Asian ancestry. We score peptide-HLA immunogenicity based upon clinical data from convalescent COVID-19 patients Liu et al. (2021). For peptide-HLA combinations not observed in our clinical data we selected a machine learning model of immunogenicity that best predicted the peptide-HLA combinations we did observe. Our selected MHC class I model predicts a peptide-HLA combination will be immunogenic if they are predicted to bind with an affinity of 50nM by the mean of the predictions from PUFFIN (Zeng and Gifford, 2019), NetMHCpan-4.0 (Jurtz et al., 2017) and MHCflurry 2.0 (O’Donnell et al., 2020). Our selected MHC class II model predicts a peptide-HLA combination will be immunogenic if they are predicted to bind with an affinity of 50 nM by NetMHCIIpan-4.0.
We implemented the OptiVax-Robust method for peptide vaccine design using the MarginalGreedy algorithm, but modified MarginalGreedy to reduce vaccine redundancy by eliminating unselected peptides that are within three (MHC class I) or five (MHC class II) edits on a sequence distance metric from the selected peptides at each iteration. We used a beam size of 10 for MHC class I and 5 for MHC class II.
Contemporary peptide vaccine design methods use machine learning scoring of peptide display for specific HLA alleles followed by the manual selection of peptide sets or a greedy search that maximizes coverage at =1. These methods do not utilize the distribution of HLA haplotypes in a population and thus can not accurately assess the population coverage provided by a vaccine. For a selected set of peptides, the IEDB Population Coverage Tool Bui et al. (2006) estimates peptide-MHC binding coverage and the distribution of peptides displayed for a given population but assumes independence between different loci and thus does not consider linkage disequilibrium. However, the IEDB tool does not provide automated peptide selection as we describe.
We compared our vaccine designs with 29 vaccine designs in the literature on the probability that a vaccine has at leastpeptide-HLA hits per individual in a population, and the expected number of per individual peptide-HLA hits in the population, which provides insight on how well a vaccine is displayed on average (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Figure 3
shows the comparison between OptiVax-Robust (A) MHC class I and (B) class II vaccine designs at all vaccine sizes (top solution in the beam up to the given vaccine size) from 1-45 peptides (blue curves) and baseline vaccines (red crosses) from prior work. We observe superior performance of OptiVax-Robust vaccine designs on all evaluation metrics at all vaccine sizes for both MHC class I and class II. Most baselines achieve reasonable coverage atpeptide hits. However, many fail to show a high probability of higher hit counts, indicating a lack of predicted redundancy if a single peptide is not displayed. Note that OptiVax-Robust also outperforms all baselines on coverage and achieve 99.99% (MHC class I) and 99.67% (MHC class II) coverage for , suggesting its capability to cover almost full population at least once while optimizing for higher peptide display diversity. We also found that OptiVax-Robust outperforms all baselines with an immunogenicity model that does not incorporate clinical data (Supplement Figure S1).
MHC class I results.
We selected an optimized set of peptides from all SARS-CoV-2 proteins using OptiVax. We use our MHC class I integrated model of peptide-HLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we considered 1546 candidate peptides. With OptiVax-Robust optimization, we design a vaccine with 18 peptides that achieves 99.999% EvalVax-Robust coverage over populations self-reporting as having Asian, Black, or White ancestry with at least one peptide-HLA hit per individual. This set of peptides also provides 99.53% coverage with at least 5 peptide-HLA hits and 92.32% coverage with at least 8 peptide-HLA hits (Figure 4A, Table S1). The population level distribution of the number of peptide-HLA hits in White, Black, and Asian populations is shown in Figure 4A, where the expected number of peptide-HLA hits 12.532, 11.425 and 13.326, respectively.
MHC class II results.
We use our MHC class II model of peptide-HLA immunogenicity for our objective function (Supplemental). After all of our filtering steps we had 8003 candidate peptides. With OptiVax-Robust optimization, we design a vaccine with 18 peptides that achieves 99.67% EvalVax-Robust coverage with at least one peptide-HLA hit per individual. This set of peptides also provides 97.22% coverage with at least 5 peptide-HLA hits and 87.99% coverage with at least 8 peptide-HLA hits (Figure 4B, Table S1). The population level distribution of the number of peptide-HLA hits per individual in populations self-reporting as having Asian, Black, or White ancestry is shown in Figure 4B, where the expected number of peptide-HLA hits is 17.206, 16.085 and 11.210, respectively.
We introduced the maximum -times coverage problem, and showed that it is not submodular. We presented a beam search algorithm for solving the problem, and used it to produce a peptide vaccine design for COVID-19. We compared the resulting optimized peptide vaccine design with 29 other published designs and found that the optimized design provides substantially better population coverage for both MHC class I and class II presentation of viral peptides. The use of -times coverage as an objective increases vaccine redundancy and thus the probability that one of the presented peptides will be immunogenic and produce a T cell response. Our methods can also be used to augment existing vaccine designs with peptides to improve their predicted population coverage by initializing the algorithm with peptides that are present in an existing design.
This work presents a principled combinatorial design approach for developing peptide vaccines using machine learning methods, and we apply it to develop a novel MHC class I and II vaccine design for COVID-19. On our metrics, the resulting design is predicted to provide better population coverage than 29 existing vaccine designs with less than 150 peptides. More broadly, the EvalVax and OptiVax methods we introduce may be used to evaluate and design peptide vaccines against viral infections or cancer to achieve greater population coverage and robust peptide display per individual over existing methods. The general formulation of the maximum -times coverage problem may find use in other fields, such as economics. One potential consequence of this work is identifying vaccines that do not provide adequate population coverage that could lead to costly failures of vaccine trials. We aim to mitigate this risk via our principled vaccine formulation objective and conservative approach for defining the candidate peptide set.
This work was supported in part by Schmidt Futures, a Google Cloud Platform grant, NIH grant R01CA218094, and a C3.AI DTI Award to D.K.G. Ge Liu’s contribution was made prior to joining Amazon.
G.L., B.C., and D.G. contributed to problem definition and solution. G.L. designed and implemented the optimization procedure, with advice from B.C. and D.G. G.L., B.C., and D.G. wrote the paper.
Declaration of Interests
Brandon Carter is an employee of Think Therapeutics, and Ge Liu is an employee of Amazon. David Gifford and Brandon Carter have a financial interest in Think Therapeutics, Inc.
- Design of multi epitope-based peptide vaccine against E protein of human 2019-nCoV: an immunoinformatics approach. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies. Viruses 12 (3), pp. 254. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Genome based evolutionary study of SARS-CoV-2 towards the prediction of epitope based chimeric vaccine. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- T cell-inducing vaccine durably prevents mucosal SHIV infection even with lower neutralizing antibody titers. Nature Medicine. Cited by: Introduction.
- Energetics based epitope screening in SARS CoV-2 (COVID 19) spike glycoprotein by immuno-informatic analysis aiming to a suitable vaccine development. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Immunoinformatics-aided identification of T cell and B cell epitopes in the surface glycoprotein of 2019-nCoV. Journal of Medical Virology. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Development of epitope-based peptide vaccine against novel coronavirus 2019 (SARS-COV-2): immunoinformatics approach. Journal of medical virology 92 (6), pp. 618–631. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Robust maximization of non-submodular objectives. In AISTATS, Cited by: Introduction.
- Predicting population coverage of T-cell epitope-based diagnostics and vaccines. BMC Bioinformatics 7 (1), pp. 153. Cited by: COVID-19 vaccine design using maximum -times coverage.
- Predominant naturally processed peptides bound to HLA-DR1 are derived from MHC-related molecules and are heterogeneous in size. Nature 358 (6389), pp. 764–768. Cited by: Introduction.
- Covering problems with hard capacities. SIAM Journal on Computing 36 (2), pp. 498–515. Cited by: Multi-set Multi-cover problem.
- UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47 (D1), pp. D506–D515. Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of cleavage regions., Self-epitope removal..
- The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade. Antiviral research 176, pp. 104742. Cited by: Removal of cleavage regions..
Worst-case analysis of greedy heuristics for integer programming with nonnegative data. Mathematics of Operations Research 7 (4), pp. 515–531. Cited by: Introduction, Multi-set Multi-cover problem.
- Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges 1 (1), pp. 33–46. Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of mutable peptides..
- Potential t-cell and b-cell epitopes of 2019-ncov. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- A threshold of ln n for approximating set cover. Journal of the ACM (JACM) 45 (4), pp. 634–652. Cited by: Set Cover and Maximum Coverage.
- Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals. Cell. External Links: Cited by: Introduction, Detailed OptiVax-Robust Vaccine Design Results.
- A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2. Cell Host & Microbe 27 (4), pp. 671–680. Cited by: Detailed OptiVax-Robust Vaccine Design Results.
- Identification of potential vaccine candidates against SARS-CoV-2, a step forward to fight novel coronavirus 2019-nCoV: a reverse vaccinology approach. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Prediction of N-glycosylation sites in human proteins. In preparation. External Links: Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of glycosylated peptides..
- Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34 (23), pp. 4121–4123. Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of mutable peptides..
- An effective CTL peptide vaccine for ebola zaire based on survivors’ CD8+ targeting of a particular nucleocapsid protein epitope with potential implications for COVID-19 vaccine design. Vaccine. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Towards personalized, tumour-specific, therapeutic vaccines for cancer. Nature Reviews Immunology 18 (3), pp. 168. Cited by: Introduction.
- A unified framework of robust submodular optimization. arXiv preprint arXiv:1906.06393. Cited by: Introduction.
- NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. The Journal of Immunology 199 (9), pp. 3360–3368. Cited by: COVID-19 vaccine design using maximum -times coverage.
- Vaccination against HPV-16 oncoproteins for vulvar intraepithelial neoplasia. New England Journal of Medicine 361 (19), pp. 1838–1847. Cited by: Introduction.
- Design of an epitope-based peptide vaccine against the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2): a vaccine informatics approach. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Multiplex identification of antigen-specific T cell receptors using a combination of immune assays and immune receptor sequencing. PLoS One 10 (10), pp. e0141561. Cited by: Introduction.
- Tight approximation results for general covering integer programs. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 522–528. Cited by: Introduction, Multi-set Multi-cover problem.
- Approximating covering integer programs with multiplicity constraints. Discrete applied mathematics 129 (2-3), pp. 461–473. Cited by: Introduction, Multi-set Multi-cover problem.
- In silico identification of vaccine targets for 2019-nCoV. F1000Research 9. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Peptide vaccine: progress and challenges. Vaccines 2 (3), pp. 515–536. Cited by: Introduction.
- Computationally optimized SARS-CoV-2 MHC class I and II vaccine formulations predicted to target human haplotype distributions. Cell Systems 11 (2), pp. 131–144. Cited by: COVID-19 vaccine design using maximum -times coverage.
- Predicted cellular immunity population coverage gaps for SARS-CoV-2 subunit vaccines and their augmentation by compact peptide sets. Cell Systems 12 (1), pp. 102–107. Cited by: Introduction, COVID-19 vaccine design using maximum -times coverage.
- Deletion-robust submodular maximization: data summarization with the right to be forgotten. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2449–2458. Cited by: Introduction.
- Multi-epitope based peptide vaccine design against SARS-CoV-2 using its spike protein. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- An analysis of approximations for maximizing submodular set functions-I. Mathematical programming 14 (1), pp. 265–294. Cited by: Set Cover and Maximum Coverage.
- A large-scale database of T-cell receptor beta (TCR) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Research Square. Cited by: Introduction.
- MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Systems 11 (1), pp. 42–48. Cited by: COVID-19 vaccine design using maximum -times coverage.
- An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547 (7662), pp. 217–221. Cited by: Introduction.
- Sequence-based prediction of vaccine targets for inducing T cell responses to SARS-CoV-2 utilizing the bioinformatics predictor RECON. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Insights into cross-species evolution of novel human coronavirus 2019-nCoV and defining immune determinants for vaccine development. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- HLA peptide length preferences control CD8+ T cell responses. The Journal of Immunology 191 (2), pp. 561–571. Cited by: Introduction.
- T cell responses to viral infections–opportunities for peptide vaccination. Frontiers in immunology 5, pp. 171. Cited by: Introduction.
- In silico approach for designing of a multi-epitope based vaccine against novel Coronavirus (SARS-COV-2). bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Next-generation sequencing of T and B cell receptor repertoires from COVID-19 patients showed signatures associated with severity of disease. Immunity 53 (2), pp. 442–455. Cited by: Introduction.
- Robust T cell immunity in convalescent individuals with asymptomatic or mild COVID-19. Cell 183, pp. 158–168. Cited by: Introduction.
- Designing a multi-epitope peptide-based vaccine against SARS-CoV-2. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Magnitude and dynamics of the T-Cell response to SARS-CoV-2 infection at both individual and population levels. medRxiv. Cited by: Introduction.
- Improved approximation guarantees for packing and covering integer programs. SIAM Journal on Computing 29 (2), pp. 648–670. Cited by: Introduction, Multi-set Multi-cover problem.
- Structural basis to design multi-epitope vaccines against Novel Coronavirus 19 (COVID19) infection, the ongoing pandemic emergency: an in silico approach. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Designing of a next generation multiepitope based vaccine (MEV) against SARS-COV-2: immunoinformatics and in silico approaches. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- Understanding the B and T cells epitopes of spike protein of severe respiratory syndrome coronavirus-2: a computational way to predict the immunogens. bioRxiv. Cited by: COVID-19 vaccine design using maximum -times coverage, Table S1, Detailed OptiVax-Robust Vaccine Design Results.
- A unique protease cleavage site predicted in the spike protein of the novel pneumonia coronavirus (2019-ncov) potentially related to viral transmissibility. Virologica Sinica, pp. 1–3. Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of cleavage regions..
- Adaptive immune activation: glycosylation does matter. Nature Chemical Biology 9 (12), pp. 776–784. External Links: Cited by: COVID-19 vaccine design using maximum -times coverage, Removal of glycosylated peptides..
- Quantification of uncertainty in peptide-MHC binding prediction improves high-affinity peptide selection for therapeutic design. Cell Systems 9 (2), pp. 159–166. Cited by: COVID-19 vaccine design using maximum -times coverage.
Details of Candidate Peptide Filtering
For our SARS-CoV-2 peptide vaccine design, we eliminate peptides that are expected to mutate and thus cause vaccine escape, peptides crossing cleavage sites, peptides that may be glycosylated, and peptides that are identical to peptides in the human proteome.
Removal of mutable peptides.
We eliminate peptides that are observed to mutate above an input threshold rate (0.001) to improve coverage over all SARS-CoV-2 variants and reduce the chance that the virus will mutate and escape vaccine-induced immunity in the future. When possible, we select peptides that are observed to be perfectly conserved across all observed SARS-CoV-2 viral genomes. Peptides that are observed to be perfectly conserved in thousands of examples may be functionally constrained to evolve slowly or not at all.
For SARS-CoV-2, we obtained the most up to date version of the GISAID database (Elbe and Buckland-Merrett, 2017) (as of 2:02pm EST May 13, 2020, acknowledgements in Supplementary table) and used Nextstrain (Hadfield et al., 2018)111from GitHub commit 639c63f25e0bf30c900f8d3d937de4063d96f791 to remove genomes with sequencing errors, translate the genome into proteins, and perform multiple sequence alignments (MSAs). We retrieved 24468 sequences from GISAID, and 19288 remained after Nextstrain quality processing. After quality processing, Nextstrain randomly sampled 34 genomes from every geographic region and month to produce a representative set of 5142 genomes for evolutionary analysis. Nextstrain definition of a “region” can vary from a city (e.g., “Shanghai”) to a larger geographical district. Spatial and temporal sampling in Nextstrain is designed to provide a representative sampling of sequences around the world.
The 5142 genomes sampled by Nextstrain were then translated into protein sequences and aligned. We eliminated viral genome sequences that had a stop codon, a gap, an unknown amino acid (because of an uncalled nucleotide in the codon), or had a gene that lacked a starting methionine, except for ORF1b which does not begin with a methionine. This left a total of 4690 sequences that were used to compute peptide level mutation probabilities. For each peptide, the probability of mutation was computed as the number of non-reference peptide sequences observed divided by the total number of peptide sequences observed.
Removal of cleavage regions.
SARS-CoV-2 contains a number of post-translation cleavage sites in ORF1a and ORF1b that result in a number of nonstructural protein products. Cleavage sites for ORF1a and ORF1b were obtained from UniProt (Consortium, 2019) under entry P0DTD1. In addition, a furin-like cleavage site has been identified in the Spike protein (Wang et al., 2020; Coutard et al., 2020). This cleavage occurs before peptides are loaded in the endoplasmic reticulum for class I or endosomes for class II. Any peptide that spans any of these cleavage sites is removed from consideration. This removes 3,739 peptides out of the 154,996 we consider across windows 8-10 (class I) and 13-25 (class II) (2.4%).
Removal of glycosylated peptides.
Glycosylation is a post-translational modification that involves the covalent attachment of carbohydrates to specific motifs on the surface of the protein. We eliminate all peptides that are predicted to have N-linked glycosylation as it can inhibit MHC class I peptide loading and T cell recognition of peptides (Wolfert and Boons, 2013). We identified peptides that may be glycosylated with the NetNGlyc N-glycosylation prediction server (Gupta et al., 2004) and eliminated peptides where NetNGlyc predicted a non-zero N-glycosylation probability in any residue. This resulted in the elimination of 18,957 of the 154,996 peptides considered (12%).
T cells are selected to ignore peptides derived from the normal human proteome, and thus we remove any self peptides from consideration for a vaccine. In addition, it is possible that a vaccine might stimulate the adaptive immune system to react to a self peptide that was presented at an abnormally high level, which could lead to an autoimmune disorder. All peptides from SARS-CoV-2 were scanned against the entire human proteome downloaded from UniProt (Consortium, 2019) under Proteome ID UP000005640. A total of 48 exact peptide matches (46 8-mers, two 9-mers) were discovered and eliminated from consideration.
Detailed OptiVax-Robust Vaccine Design Results
Table S1 shows the evaluation of our OptiVax-Robust vaccine designs using the MarginalGreedy algorithm compared to 29 designs by others as baselines (Lee and Koohy, 2020; Fast et al., 2020; Poran et al., 2020; Bhattacharya et al., 2020; Baruah and Bose, 2020; Abdelmageed et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020; Herst et al., 2020; Vashi et al., 2020; Akhand et al., 2020; Mitra et al., 2020; Khan et al., 2020; Banerjee et al., 2020; Ramaiah and Arumugaswami, 2020; Gupta et al., 2020; Saha and Prasad, 2020; Tahir ul Qamar et al., 2020; Singh et al., 2020). Additionally, we compare to baseline vaccines containing the full S and S1 proteins (Grifoni et al., 2020b), OptiVax-Robust designed vaccines with peptides stemming from the S, M, and N proteins only (Grifoni et al., 2020a), and random subsets of peptides that are binders (averaged over 200 runs, each consisting of the same number of peptides as the OptiVax-Robust design).
|Peptide Set||Vaccine Size||EvalVax-Robust||EvalVax-Robust||EvalVax-Robust||Exp. # peptide-HLA hits / vaccine size||Exp. # peptide-HLA hits (White)||Exp. # peptide-HLA hits (Black)||Exp. # peptide-HLA hits (Asian)|
|MHC Class I Peptide Vaccine Evaluation|
|Srivastava et al. (2020)||%||%||%|
|Herst et al. (2020)||%||%||%|
|Random subset of binders||%||%||%|
|Herst et al. (2020)-top16||%||%||%|
|Baruah and Bose (2020)||%||%||%|
|Fast et al. (2020)||%||%||%|
|Vashi et al. (2020)||%||%||%|
|Abdelmageed et al. (2020)||%||%||%|
|Lee and Koohy (2020)||%||%||%|
|Ahmed et al. (2020)||%||%||%|
|Poran et al. (2020)||%||%||%|
|Singh et al. (2020)||%||%||%|
|Akhand et al. (2020)||%||%||%|
|Bhattacharya et al. (2020)||%||%||%|
|Khan et al. (2020)||%||%||%|
|Saha and Prasad (2020)||%||%||%|
|Gupta et al. (2020)||%||%||%|
|Mitra et al. (2020)||%||%||%|
|MHC Class II Peptide Vaccine Evaluation|
|Ramaiah and Arumugaswami (2020)||%||%||%|
|Random subset of binders||%||%||%|
|Fast et al. (2020)||%||%||%|
|Banerjee et al. (2020)||%||%||%|
|Tahir ul Qamar et al. (2020)||%||%||%|
|Poran et al. (2020)||%||%||%|
|Akhand et al. (2020)||%||%||%|
|Singh et al. (2020)||%||%||%|
|Ahmed et al. (2020)||%||%||%|
|Mitra et al. (2020)||%||%||%|
|Vashi et al. (2020)||%||%||%|
|Abdelmageed et al. (2020)||%||%||%|
|Baruah and Bose (2020)||%||%||%|
Evaluation of OptiVax-Robust Vaccine Design and baselines with non-experimentally-calibrated machine learning predictions
We found that MarginalGreedy produced vaccine designs superior to the baselines we tested even when it used immunogenicity models that did not rely upon clinical data of peptide immunogenicity. Since our immunogenicity model used experimental data that was not accessible to the baseline methods, our designs have an advantage over baseline vaccines that did not use calibrated machine learning predictions. We repeated vaccine design using an ensemble of NetMHCpan-4.0 and MHCflurry with 50 nM predicted affinity cutoff for predicting MHC class I immunogenicity and NetMHCIIpan-4.0 for MHC class II. As shown in Figure S1, OptiVax-Robust again shows superior performance on all metrics at all vaccine size under 35, indicating the success of MarginalGreedy in optimizing population coverage with diverse peptide display.