SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences. We release our code and dataset on https://github.com/HLTCHKUST/snp2vec.

READ FULL TEXT VIEW PDF

page 4

page 14

page 15

10/11/2021

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

In the genome biology research, regulatory genome modeling is an importa...
10/29/2020

Self-supervised Pre-training Reduces Label Permutation Instability of Speech Separation

Speech separation has been well-developed while there are still problems...
04/25/2021

How Well Self-Supervised Pre-Training Performs with Streaming Data?

The common self-supervised pre-training practice requires collecting mas...
04/11/2022

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-...
03/25/2021

Contrast to Divide: Self-Supervised Pre-Training for Learning with Noisy Labels

The success of learning with noisy labels (LNL) methods relies heavily o...
03/03/2020

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

In this paper, we introduce the Chinese corpus from CLUE organization, C...
06/17/2022

Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

Pre-training on time series poses a unique challenge due to the potentia...

1 Introduction

Self-supervised pre-training has become an indispensable step for almost all natural language processing (NLP) tasks 

devlin2019bert; liu2019roberta; yang2019xlnet

. Pre-trained language models, thanks to the usage of massive text corpora, are effective in handling data scarcity and generalizing to unseen examples 

brown2020gpt3; cahyawijaya2021indonlg; wilie2020indonlu; yu2021adaptsum; liu2021crossner; winata2021-multilingual. Inspired by the success of pre-trained language models, pre-trained genomic models have been proposed to cope with genomic sequence prediction tasks zaheer2020big; ji2021dnabert. However, these models only focus on modeling the four nucleobases (i.e., A, T, C, and G), while ignoring genomic variations in the pre-training stage. Although they are effective in haploid pattern analysis, such as promoter region and chromatin-profile prediction, they fail to tackle more complex and challenging tasks, such as genome-wide association study (GWAS) wtccc2007gwas; corvin2010gwas; bush2012gwas, which require an in-depth understanding of long genomic sequences and the genomic variation between a homologous chromosome pair.

To address these shortcomings, we introduce a self-supervised pre-training approach called SNP2Vec, which leverages the single-nucleotide polymorphism (SNP, pronounced ‘snip’) information gathered from a large-scale SNP database to inject genomic variations in the pre-training stage. SNP2Vec enables the model to learn the semantics of a diploid sequence (genotype) pattern in a diploid cell. We apply SNP2Vec to a linear-attention model, Linformer 

wang2020linformer, to allow the model to encode long genomic sequences for Alzheimer’s disease risk prediction in a Chinese cohort. We compare SNP2Vec with non-pretrained models, as well as an existing strong baseline polygenic risk scoring (PRS) model, to demonstrate the effectiveness of our approach.

Our contributions are summarized as follows:

  • We are the first to introduce a scalable self-supervised pre-training approach (SNP2Vec) to learn genomic variations, which is popular for genome-wide association study.

  • We demonstrate a method for modeling long diploid sequences with a length of >20,000 base pairs (bps) using an attention-based model within a single forward pass.

  • We demonstrate the effectiveness of SNP2Vec, which significantly outperforms all the baselines, including a widely-used polygenic risk scoring (PRS) method, by 5-7% accuracy and AUROC for the Alzheimer’s disease prediction task in a Chinese elderly cohort.

  • We conduct comprehensive analyses to show the effectiveness of SNP encoding and Byte Pair Encoding (BPE) tokenization compared to the other commonly used methods for genomics modeling.

2 Related Works

2.1 Genome-Wide Association Study

To this day, predicting the risk of hereditary diseases from a given genotype is done through genome-wide Associaction Study (GWAS) by applying a polygenic risk score (PRS). PRS utilizes GWAS data to identify important single nucleotide variations (SNVs) over a certain range from the gene of interest. The SNVs are first filtered according to a statistical measure to reduce the bias towards a certain population and the filtered SNVs are then used to build a classifier, which can be applied to a new genotype to determine the likelihood of getting the disease. This method has been applied by many works and has provided valuable insights for researchers to diseases including heart attack, diabetes, and different types of cancer 

lello2019genomic. Moreover, PRS model has also been used in research and clinical practice for Alzheimer’s disease zhou2021polygenic. Nevertheless, all these methods fail to incorporate the patterns of the genomics sequence that determines the actual function. This is likely to lead the model towards non-representative bias, especially when the experimental data is small.

2.2 Statistical Modeling for Genomics

Tokenization in Genomics

-mer (synonymous to n-gram) tokenization is the most commonly used tokenization method in existing genome modelling works

min2017kmeremb; shen2018recurrent. Gapped -mer tokenization ghandi2014enhanced; shrikumar2019gkmexplain is a more efficient variant of -mer tokenization by introducing the gap parameter

, which constitutes the stride between each

-mer window. However, the gapped -mer approach will lead to the loss of some information when is larger than . In recent years, subword tokenization approaches sennrich2015neural; kudo2018sentencepiece have also been explored in genomics zaheer2020big.

Machine Learning in Genomics

The support vector machine (SVM) is a traditional machine learning approach used to quickly and accurately interpret the nonlinear gapped k-mer

shrikumar2019gkmexplain. hill2018deep

leverage a deep recurrent neural network (RNN) to discover complex biological rules to decipher RNA protein-coding potential.

zhuang2019simple

incorporate convolutional neural network (CNN) to predict enhancer–promoter interactions with DNA sequence data.

shen2018recurrent introduce a RNN to predict transcription factor binding sites. They treat each k-mer as a word and pre-train a word representation model though word2vec algorithm mikolov2013efficient. zaheer2020big propose BigBird and pre-train it on the human reference genome and improves the performance on downstream tasks.

2.3 Self-Supervised Pre-training

Recently, using self-supervised pre-training models on large scale unlabeled data and then fine-tuning them using a small amount of labeled data has become the norm in machine learning. BERT devlin2019bert is a deep bidirectional transformer pre-trained on BooksCorpus zhu2015aligning (800M words) and English Wikipedia (2500M words) for language understanding. liu2019roberta introduces Roberta, which has a similar architecture as BERT but trained on a much larger corpus (160GB of text) and consequently achieves better performance. In recent years, pre-training generative models radford2019language; raffel2019exploring; lewis2019bart has significantly improved the performance of various language generation tasks such as machine translation, question answering, conversational AI, etc.

Self-supervised learning approaches have also been adopted in genomics 

zaheer2020big; ji2021dnabert and proteomics madani2020progen; elnaggar2020prottrans. These methods pre-train models using large-scale unlabelled datasets such as the human reference genome from the Genome Reference Consortium (GRC) church2011hg19; schneider2017refgen38 and protein sequence databases such as SWISS-PROT and TrEMBL boeckmann2003swissprot. In this paper, we focus on genomics and conduct the human reference genome for pre-training. Genomics data does not have the same structure as human languages; it has no known syntax or grammatical rules and it consists of very long sequences with only a number of differences between each human subject.

Figure 1: SNP2Vec Pre-training Pipeline. SNP2Vec merges information from reference genome and SNP database to form a chromosome matrix which is then utilized to construst SNP pre-training dataset following the SNP encoding’s token format. This pre-training dataset is employed to train a genome language model through the masked language modeling task.

3 SNP2Vec

Existing pre-training methods in genomics, such as BigBird zaheer2020big and DNABERT ji2021dnabert, are only optimized to understand the pattern of a haploid sequence (haplotype) based on the reference genome. This hinders the model from learning genomic variations, which is essential for understanding traits in humans. In contrast to prior works in genomics pre-training, we develop SNP2Vec to enable pre-training for encoding and understanding patterns of genomic variations in a diploid sequence. Figure 1 depicts the overall structure of the SNP2Vec pre-training method. We elaborate on our SNP2Vec method in 3 subsections: 1) SNP Encoding, i.e., how we encode a diploid sequence as a sequence of SNP tokens; 2) Self-Supervised SNP Dataset, i.e., how we construct a self-supervised dataset using the SNP token; 3) Self-supervised SNP Pre-training, i.e., how we perform self-supervised pre-training for learning the sequence pattern of SNP tokens.

3.1 Preliminaries

What are haploid and diploid sequences?

A diploid is a cell or organism that has paired chromosomes, one from each parent 111https://www.genome.gov/genetics-glossary/Diploid. Human cells are mostly diploid, except for the sex cells. In this sense, a diploid sequence (genotype) refers to a pair of homologous sequences (allele) inside the diploid chromosome, while a haploid sequence (haplotype) refers to the DNA sequence from the specific allele of the diploid sequence. The haploid sequence is suitable for understanding the regulatory function of a DNA pattern zhou2015dsea; ouyang2008haplotype, such as determining a binding site for a certain type of protein, as it provides the representation of the actual nucleotides. A diploid sequence, on the other hand, is more suitable for understanding the phenotype levy2007diploid; wang2008diploid over population since it allows understanding of the genomic variations between two homologous DNA sequences, which tells the dosage information and the gene expression level of a variation. These genomic variations are gathered by comparing them to a genome reference sequence, and they can be categorized based on its dosage, i.e., wild-type (normal), heterozygous, or homozygous, and based on their differences, i.e., substitution, insertion, and deletion. The depiction of haploids and diploids along with their variations is shown in Figure 2.

Figure 2: Diploid sequence variations. The box on the top-left shows the wild-type sequence, while others are its variations. H1 and H2 denote the haploid sequence for each parent allele. D represents the diploid sequence of the two alleles.

How do we get the haploid and diploid sequence?

As most human cells are predominantly diploid, performing genome sequencing on such homologous chromosome pair will produce a diploid sequence rather than a haploid sequence, because the primer binds to both of the homologous regions from each chromosome ye2012primer

. Extracting haploid sequences from a diploid sequence requires an additional step through an estimation process called phasing 

stephens2001phasing. Despite their effectiveness, the quality of phasing methods browning2007beagle; browning2009beagle; patterson2014whatshap is not perfect and tends to decrease significantly especially when the gap between the SNPs is large choi2018phasing.

3.2 SNP Encoding

We first extend the nucleotide tokens from 5 token types ‘A’, ‘T’, ‘C’, ‘G’, and ‘N’ into 11 tokens by adding 6 insertion-deletion (indel) tokens ‘AI’, ‘TI’, ‘CI’, ‘GI’, ‘NI’, and ‘DEL’, where ‘XI’ token represents any insertion after the nucleotide ‘X’, and ‘DEL’ represents the nucleotide deletion. There can be many different possibility for insertion, e.g., a nucleotide ‘T’ can be inserted into “TG”, “TGGG”, or “TAAA”; therefore we aggregate all the insertions into a single token to reduce the sparsity of the indel representaton as indel occurs relatively rarely compared to substitution, with an around 1:5 ratio chen2009variation. To encode a diploid sequence, we construct all the combinations with replacement () of the 11 haploid and indel tokens with and , producing a total of 66 types of SNP tokens consisting of wild-type, heterozygous, and homozygous variation tokens. The resulting SNP tokens are represented as ‘X/X”, where ‘X’ and ‘X’ denote aligned nucleotide or indel tokens from the two alleles ordered alphabetically. A depiction of the SNP tokens is shown in Figure 3. To reduce the size and facilitate more straightforward representation for downstream processes such as pre-processing, tokenization, and modeling, we map the SNP tokens into a single character representation. The mapping of the SNP token into a single character representation is shown in Appendix A.

By incorporating the SNP encoding, variant calling information gathered from the DNA sequencing machine can be directly converted into a sequence of SNP tokens, that are then used for the model fine-tuning and inference. However, this is not directly applicable for self-supervised pre-training since DNA sequencing data is hard to obtain and it is unethical to share publicly as it contains very sensitive and personal information of the human subject. In the next section, we discuss in detail how we can construct an inexpensive and reliable pre-training dataset to perform self-supervised pre-training on the SNP tokens by utilizing publicly available genomics data sources.

3.3 Self-Supervised SNP Dataset

Prior self-supervised pre-training approaches in genomics zaheer2020big; ji2021dnabert only utilize the human reference genome church2011hg19; schneider2017refgen38 as the unlabelled data for haploid genomics pre-training, the latter does not capture any genomic variations. We extend these haploid modeling techniques into a diploid modeling method, which allows the model to learn patterns of genomic variations by generating unlabelled pre-training data for learning SNP tokens. More specifically, we use the genome sequence from the human reference genome and genome variation from a large-scale SNP database, namely dbSNP smigielski2000dbsnp, to generate the pre-training data.

Figure 3: SNP tokens consist of a total of 66 types of token covering all possible variations in a diploid sequence including wild-types, heterozygous variations, and homozygous variations.

Human reference genome

The human reference genome is a genome sequence derived from the DNA collected from a number of people pollard2017chromosome, which was first released in 2000 and is periodically updated. There are two most commonly used versions of the human reference genome, namely GRCh37 church2011hg19 222https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/ and GRCh38 schneider2017refgen38 333https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/. A human reference genome consists of the genome sequence information for all human chromosomes with 3B sequence length in total. Most of the positions are mapped and represented as either ‘A’, ‘T’, ‘C’, or ‘G’, while the others are unmapped and flagged with the unknown (‘N’) token.

dbSNP

dbSNP smigielski2000dbsnp 444https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi is a central public repository of human SNPs. dbSNP covers a broad collection of simple genetic variations with a length of variation 50 bps long, which includes single-base nucleotide substitutions, small-scale multi-base deletions, and small-scale multi-base insertions. A single SNP in the dbSNP contains the following information: chromosome number, position in the chromosome, SNP identifier, reference sequence (REF), alternative sequence(s) (ALTS

), probability of the

REF and ALTS, and other metadata. The REF is a single-base or multi-base sequence that comes from the human reference genome used for detecting the SNPs. The ALTS can consist of one or more alternative variations and each can represent a substitution, a deletion, or an insertion.

Dataset Construction

We construct a pre-training dataset consisting of sequences of SNP tokens by combining the sequence information from the human reference genome and the genomic variations from the dbSNP. For each chromosome, we generate an matrix, where is equal to the length of the corresponding chromosome and represents the probability of each nucleotide and indel token. We name this matrix a chromosome matrix. We fill the chromosome matrix using all SNPs labelled as COMMON in the dbSNP by filling the corresponding matrix position with the REF and ALTS probability of the corresponding SNP record. Since the SNPs from the dbSNP do not cover all of the genome positions, we fill up all the other gap positions with a probability of 1 to the nucleotide token in the corresponding position on the human reference genome.

For constructing the self-supervised pre-training dataset, we closely follow the setup in the typical NLP pre-training dataset construction pipeline. Specifically, we convert the chromosome matrix into a set of segments where each segment comprises of a number of SNP tokens. To construct the sentences , we sample multiple sequences from different positions of a chromosome. For each position in the sequence, we apply a sampling function to collect ‘X1’ and ‘X2’ (the nucleotide or indel tokens on the corresponding position) and construct the SNP token “X1/X2”. The dataset construction method can be applied to all the chromosome pairs except for the sex chromosome, which is always haploid. The details of our dataset construction approach is shown in Algorithm 1.

3.4 Self-Supervised SNP Pre-training

Inspired by BERT devlin2019bert, SNP2Vec is trained using the masked language modeling (MLM) objective using a transformer-based model vaswani2017transformer. The goal of MLM is to predict the representations of the masked tokens given their neighbouring sequence as the context. As complex genomic tasks, such as disease risk prediction, require the understanding long-genome sequence (>1000 bps), we apply two methods to process long input sequences. First, we apply a transformer variant with a linear-attention mechanism, which enables the model to reduce the computational complexity from to . Second, we apply a BPE tokenization sennrich2015neural to encode the sequence of SNP tokens to compress the sequence via aggregation of neighbouring tokens. Unlike k-mer min2017kmeremb; ji2021dnabert and gapped k-mer ghandi2014enhanced; shrikumar2019gkmexplain tokenizations, BPE tokenization can merge dynamic-length tokens based on their co-occurrences efficiently without losing any information.

Require: : chromosome matrix
Require: : SNP sampling function
Require: : number of iterations
Require: : start position threshold
Require: : lower bound of segment length Require: : upper bound of segment length

1:Initialize =
2: = sample positions from range
3:for all  do
4:     while  do
5:         
6:          = segment from to in
7:          = Sample SNP tokens using from . ....................each position in
8:         
9:         
10:     end while
11:end for
Algorithm 1 Self-Supervised Pre-training dataset construction for diploid SNP Encoding

4 Experiment Settings

4.1 Dataset

For building the pre-training data, we utilize GRCh37 as the human reference genome and dbSNP version 153 555https://ftp.ncbi.nih.gov/snp/archive/b153/00readme.txt as the SNP database. We utilize a weighted random sampling based on the probability of SNPs on the corresponding position as the sampling function . For the downstream-task, we construct a dataset of genome sequences for predicting late-onset Alzheimer’s disease (LOAD) rabinovici2019load on a Chinese Cohort from 624 Hong Kong elderly with a minimum age of 65. The subjects are diagnosed with Alzheimer’s by a medical professional through the Montreal Cognitive Assessment (MoCA) test nasreddine2005moca adjusted for the demographic information. Out of 624 subjects, 384 are labelled as Alzheimer’s disease carriers (ADs) and 240 are labelled as non-carriers (NCs). For the genome sequence, we collect sequencing data from the APOE region located in chromosome 19 from each subject, which is known to be highly correlated with Alzheimer’s disease in the Chinese cohort zhou2019apoe; zhou2020adcn. We use BWA-MEM li2013aligning assembler to align the sequencing data with the human reference genome.

4.2 Training and Evaluation Setting

For our experiment, we build a BPE tokenizer with a vocabulary size of 32,000 tokens. We pre-train a 6-layers linear-attention transformer-based model, Linformer wang2020linformer, using a maximum sequence length of 4,096 tokens, a sequence projection length of 128 tokens, and a model dimension size of 512. For simplicity, we refer to our pre-trained SNP2Vec model as Dipformer

. The detail hyperparameters of the BPE tokenizer and the Dipformer model are described in Appendix 

B. We run MLM pre-training for 200,000 steps with a 15% token replacement rate, where we replace with [MASK] 80% of the time, replace with a random token 10% of the time, and keep the token as is 10% of the time. More detail about the pre-training hyperparameter setting is shown in Appendix C.

For the fine-tuning, we apply SNP encoding to the sequencing data, apply BPE tokenization, and add a [CLS] token as the prefix of the sequence to gather the sequence representation for predicting the risk of having Alzheimer’s disease. We apply fine-tuning for three input sequence length settings, i.e., only APOE gene with 3,611 bps ( APOE only), APOE with additional 5,000 bps upstream and downstream (APOE+10k), and APOE with additional 10,000 bps upstream and downstream (APOE + 20k

). For each experiment, we apply 10-fold cross validation to ensure the result is significance. We evaluate the model performance using three evaluation metrics: accuracy, area under the ROC curve (AUROC), and area under the precision-recall curve (AUPRC). More detail about the fine-tuning setup is described in Appendix 

D.

4.3 Baselines

To evaluate the effectiveness of the SNP encoding, we build two different deep learning models using haploid token representation. First, we incorporate DeepSEA 

zhou2015dsea, a CNN-based model develop for short sequence chromatin profiling tasks (200-1000 bps), and then we build another Linformer model pre-trained with the human reference genome using haploid tokens, called Hapformer. For the haploid token fine-tuning, we generate the haploid sequence from the aligned sequencing data. We generate the variant calling data with GATK HaplotypeCaller mckenna2010gatk; depristo2011gatk and apply phasing with Beagle browning2007beagle; browning2009beagle

. During fine-tuning, we feed each haploid sequence to the model and fuse the representation using a linear transformation. We also incorporate a logistic regression model from PLINK 

purcell2007plink, which is a widely used approach for PRS.

Models Acc AUROC AUPRC
DeepSEA 0.591 0.579 0.703
PLINK PRS 0.592 0.607 0.705
Hapformer 0.572 0.615 0.715
Dipformer 0.643 0.673 0.734
Table 1: Results of our model and baselines. We refer the pre-trained SNP2Vec model as Dipformer.
Tokenization 1-mer 3-mer 5-mer gkm (5,6) gkm (6,10) gkm (6,14) gkm (7,14) BPE
Avg. Token Length 500 498 496 83 50 36 36 81.19
BoW Linear 0.499 0.698 0.817 0.771 0.759 0.749 0.753 0.783*
CNN (DeepSEA) 0.890 0.903 0.898 0.808 0.764 0.749 0.751 0.811*
Transformer 0.727 0.788 0.825 0.785 0.771 0.761 0.762 0.789*
Average 0.706 0.796 0.847 0.788 0.765 0.753 0.755 0.795*
Table 2: Comparison of different tokenization methods in genome modeling (numbers denote the accuracy score), where gkm (,) denotes the gapped -mer tokenization with the gap parameter constituting the stride between each -mer window. * denotes that BPE significantly outperforms the underlined baselines with a p-value < 0.01.

5 Results

The results of our model and baselines are shown in Table 1. We find that Dipformer is able to outperform existing strong baselines, such as DeepSEA and PLINK PRS, by a large margin. This confirms the effectiveness of our SNP2Vec pre-training, and the ability of our Dipformer to capture relevant features for AD prediction. Interestingly, Hapformer, which leverages large amounts of genomic sequences for pre-training, only performs comparably to DeepSEA and PLINK PRS. Moreover, by learning genomic variations in a diploid sequence during the pre-training, Dipformer significantly outperforms Hapformer with an around 5-7% improvement in terms of accuracy and AUROC metrics. This shows that simply using an enormous amount of pre-training data might not necessarily improve the AD prediction, and an effective genomics pre-training approach is essential to guarantee full use of the unlabelled genomics data. More detail on our results is shown in Appendix E.

6 Discussion

width=totalheight=,keepaspectratio Figure 4: 10-folds AUROC Performance of Dipformer with and without pre-training on the Alzheimer’s disease risk prediction over different sequence length input.        Figure 5: 10-folds AUROC performance of pre-trained Hapformer and Dipformer on the Alzheimer’s disease risk prediction over different sequence length input.

6.1 Effect of Different Tokenization Methods in Genomics

In this section, we study different tokenization methods for genome modeling, and explore their effectiveness in terms of capturing genomic patterns and features. We compare BPE tokenization with other common methods, such as -mer and gapped -mer (gkm) with various gap parameters. To achive this, we conduct experiments on the chromatin profiling dataset from DeepSEA, which consists of 4,863,024 chromatin profiles (4,400,000 training, 8000 validation, and 455,024 test) with 919 labels (690 transcription factor (TF) binding sites, 125 DNase marks, and 104 Histone marks). Three different models are incorporated in this experiment: a linear model with bag-of-word (BoW) representation, a CNN-based model following the DeepSEA architecture, and a transformer model. The models need to predict the TF, DNase, and Histone labels based on the input sequences using various tokenization methods. Hence, for the same model, a more effective tokenization method will lead to a higher prediction accuracy. Additionally, we use the average length of the tokenized sequences to measure the efficiency of different tokenization methods as it determines the input size for the model.

Table 2 provides the effectiveness and averaged token length of different tokenization methods in genome modeling. We find that, on the Linear BoW model, BPE significantly outperforms all other methods except 5-mer. On the CNN model, BPE remarkably surpasses all gapped k-mer methods except for the gkm (5,6). On the Transformer model, BPE performs similarly to 3-mer and gkm (5,6), and significantly outperforms 1-mer and other gkm methods. Moreover, in terms of the averaged score across all three models, BPE performs comparably well to 3-mer, and remarkably outperforms 1-mer and all gkm methods.

Figure 6: Performance efficiency trade-off of using different tokenization approaches. The score is averaged over the three models (Linear, CNN, and Transformer). The size of the dots represents the vocabulary size of the tokenization method.

Figure 6 illustrates the trade-off between the performance and efficiency of different tokenization methods. We can see that compared to -mer methods, BPE performs comparably to 3-mer and slightly worse than 5-mer, but it is much more efficient due to a much shorter average length. In addition, BPE remarkably outperforms gkm methods with comparable or slightly worse efficiency. Furthermore, from the size of the dots, we can see that BPE has a much larger vocabulary size compared to other methods, which indicates that BPE can potentially capture richer genomics patterns.

6.2 Effect of Pre-training for Disease Risk Prediction

In this section, we focus on exploring the effectiveness of pre-training for disease risk prediction. Figure 4 illustrates the 10-fold AUROC results of our Dipformer model with and without pre-training on Alzheimer’s disease risk prediction. The dashes in the figure represent the average AUROC for all 10-fold results. As shown in Figure 4, the average AUROC scores for pre-trained Dipformer significantly outperform the Dipformer without pre-training in all sequence length settings, APOE + 10k, and APOE + 20. Table 3 presents the quantitative results with additional metrics. The accuracy, AUROC, and AUPRC scores of pre-trained Dipformer consistently outperform the non-pre-trained Dipformer in all sequence length settings. By increasing the sequence length, the non-pre-trained Dipformer performs slightly better, while the pre-trained Dipformer improves by a large margin. This shows the importance of pre-training for understanding long-sequence features.

6.3 Effect of the SNP Encoding in Genomics

To study the effect of the SNP encoding, we pre-train and fine-tune a model with the same genomics data but using haploid tokens called Hapformer, as mentioned in the Section 4. Figure 5 shows the 10-fold AUROC results of pre-trained Hapformer and Dipformer on the AD risk prediction over different sequence length inputs. Among all three sequence length settings, Dipformer achieves better average AUROC scores than Hapformer with a p-value of 0.046 for the APOE + 20k setting, which indicates that the improvement of SNP encoding is significant. Meanwhile, the results in Table 3 shows that Dipformer also surpasses Hapformer in all other evaluation metrics. In addition, we also observe that both Hapformer and Dipformer achieve better results when the input sequence is longer. This shows that employing long sequence is essential for handling complex genomics tasks such as disease risk prediction.

width=totalheight=,keepaspectratio Model Accuracy AUROC AUPRC   Without Pre-training Dipformer (APOE only) 0.567 0.532 0.667 Dipformer (APOE + 10k) 0.571 0.520 0.608 Dipformer (APOE + 20k) 0.588 0.551 0.668   With Pre-training Hapformer (APOE only) 0.524 0.491 0.623 Hapformer (APOE + 10k) 0.565 0.591 0.705 Hapformer (APOE + 20k) 0.572 0.615 0.715 Dipformer (APOE only) 0.611 0.576 0.687 Dipformer (APOE + 10k) 0.574 0.612 0.710 Dipformer (APOE + 20k) 0.643 0.673 0.734

Table 3: Performance of Dipformer and Hapformer on the Alzheimer’s disease risk prediction over different lengths of the input sequences.

7 Conclusion

In this paper, we introduce SNP2Vec, a self-supervised pre-training method for understanding genomic variations in a diploid sequence. Unlike prior methods in genomics, SNP2Vec represents each genomics position with a SNP token which allows the model to capture genomic variations which is suitable for understanding complex genomics prediction tasks such as predicting phenotype. By utilizing SNP2Vec, we pre-train a Linformer model called Dipformer and evaluate it for predicting late-onset Alzheimer’s disease risk in a Chinese cohort. Experimental results suggest that Dipformer significantly improves the prediction quality by 5-7% Accuracy and AUROC over all other baselines including the widely used polygenic risk score model from PLINK, the haploid-variant of Dipformer, and a CNN-based genomics model called DeepSEA.

8 Future Work

For future works, we expect to focus on model explainability by using multiple analysis methods, such as analyzing the attention behaviour, analyzing the gradient saliency map, etc, to gather and verify insights from the model. Evaluation on larger scale dataset is also necessary to further demostrate the effectiveness of SNP2Vec. Additionally, adoption of SNP2Vec to other hereditary disorders and other complex genomics tasks is also an essential direction for future works.

Acknowledgement

This work has been partially funded by School of Engineering PhD Fellowship Award, the Hong Kong University of Science and Technology and PF20-43679 Hong Kong PhD Fellowship Scheme, Research Grant Council, Hong Kong.

References

Appendix A Mapping of SNP Tokens

Mapping of SNP Tokens
A/A A DEL/A A/C AI/C C/G CI/G
C/C C DEL/AI A/G AI/CI C/N CI/GI
G/G G DEL/C A/N AI/G C/T CI/N
N/N N DEL/CI A/T AI/GI C/CI CI/NI
T/T T DEL/G A/AI AI/N C/GI CI/T
AI/AI B DEL/GI A/CI AI/NI C/NI 穿 CI/TI
CI/CI D DEL/N A/GI AI/T C/TI G/GI
GI/GI H DEL/NI A/NI AI/TI GI/N G/N
NI/NI O DEL/T A/TI N/NI GI/NI G/NI
TI/TI U DEL/TI NI/T N/T GI/T G/T
DEL/DEL X T/TI NI/TI N/TI GI/TI G/TI
Table 4: Mapping of SNP tokens into a single character representation.

As our resulting SNP tokens are represented as ‘X/X”, to reduce the size and facilitate more straightforward representation for the downstream process in the NLP pipeline, such as pre-processing, tokenization, and modeling, we map all SNP tokens into a single character representation. The mapping of the SNP tokens into a single character representation is shown in Table 4. We use non-alphabetical characters as there are 66 SNP tokens in total, more than the available alphabetical characters, which consists of 52 characters (lower and upper case from ‘A’ to ‘Z’) in total. Also note that, all the SNP tokens related to the unkown token ‘N’ except ‘N/N’ (such as ‘A/N’, ‘G/NI’, ‘N/NI’, ‘NI/NI’, etc) are never been used since there is no actual SNP record corresponding to the unknown token ‘N’. The combinations of all ‘N’ and ‘NI’ tokens are listed on the table only for completion.

Appendix B Model Hyperparameters

We develop two Linformer wang2020linformer models, i.e., Dipformer and Hapformer, which is pre-trained using our proposed SNP tokens and the original nucleotide tokens, respectively. The two models have the same hyperparameter settings resulting in an equal number of parameters. We list all the hyperparameters of our Dipformer and Hapformer models in Table 7.

width=totalheight=,keepaspectratio Hyperparams Value #layers 6 dim 512 k 128 dropout 0.1 num heads 8 dim head 64 num embeddings 32000 single KV head False shared KV False

Table 5: Model Hyperparameters
Hyperparams Value batch size 240 optimizer AdamW learning rate 1e-4 scheduler 1 1 scheduler 2 0.999991 #steps 200,000 warmup step 1000 loss fn Cross Entropy random seed 0
Table 6: Pre-Training Hyperparameters
Hyperparams Value batch size 16 optimizer AdamW learning rate [1e-4..1e-6] scheduler 1 1 scheduler 2 0.999991

#epoch

30
early stopping 3 loss fn Cross Entropy random seed 0
Table 7: Fine-Tuning Hyperparameters

Appendix C Pre-Training Setup

During the pre-training phase, we build the BPE tokenizer with a vocab size of 32,000 for both the SNP tokens and nucleotide tokens datasets. We perform pre-training on both Dipformer and Hapformer models for 200,000 steps using masked language modeling with the cross entropy loss. During the pre-training, we apply a masking strategy similar to BERT devlin2019bert with a 15% token replacement rate, where we replace with [MASK] 80% of the time, replace with a random token 10% of the time, and keep the token as is 10% of the time. We run the pre-training using 5 units of 2080Ti GPUs and an Intel(R) Xeon(R) Silver 4210 CPU. We use the same hyperparameter settings for pre-training both the Dipformer and Hapformer models. The hyperparameters of our pre-training are shown in Table 7.

Appendix D Fine-Tuning Setup

We fine-tune all models on Alzheimer’s disease risk prediction on a Chinese cohort consisting of 624 subjects in total, 384 of which are labelled as Alzheimer’s disease carriers (ADs) while 240 others are non-carriers (NCs). For predicting Alzheimer’s disease, we append a [CLS] token as the prefix of the sequence. During the fine-tuning, we take the output of the [CLS] token and perform a linear transformation on it to get the disease risk prediction. We evaluate the performance of all models using accuracy, area underthe ROC curve (AUROC), and area under the precision-recall curve (AUPRC). We show all the hyperparameters of the fine-tuning phase in Table 7. We experiment with different learning rate for each model and find that the best setting is achieved when using a learning rate of 1e-4 for models that are not pre-trained (non-pre-trained Dipformer and DeepSEA) and a learning rate of 1e-5 for all pre-trained models (Dipformer and Hapformer).

Appendix E Detailed Results

In this section, we show the distribution of the 10-fold results from our experiment in the Alzheimer’s disease risk prediction task for all models (Dipformer, Hapformer, DeepSEA, and PLINK) on each evaluation metric. Figure 7 shows the distribution of the best 10-folds accuracy performance on the Alzheimer’s disease risk prediction task. Figure 8 shows the distribution of the best 10-folds AUROC performance on the Alzheimer’s disease risk prediction task. Figure 9 shows the distribution of the best 10-folds AUPRC performance on the Alzheimer’s disease risk prediction task.

Figure 7: 10-folds accuracy performance of the best Dipformer, Hapformer, DeepSEA, and PLINK models on the Alzheimer’s disease risk prediction.
Figure 8: 10-folds AUROC performance of the best Dipformer, Hapformer, DeepSEA, and PLINK models on the Alzheimer’s disease risk prediction.
Figure 9: 10-folds AUPRC performance of the best Dipformer, Hapformer, DeepSEA, and PLINK models on the Alzheimer’s disease risk prediction.