ER-index: a referential index for encrypted genomic databases

10/07/2019
by   Ferdinando Montecuollo, et al.
0

Huge DBMSs storing genomic information are being created and engineerized for doing large-scale, comprehensive and in-depth analysis of human beings and their diseases. However, recent regulations like the GDPR require that sensitive data are stored and elaborated thanks to privacy-by-design methods and software. We designed and implemented ER-index, a new full-text index in minute space which was optimized for compressing and encrypting collections of genomic sequences, and for performing on them fast pattern-search queries. Our new index complements the E2FM-index, which was introduced to compress and encrypt collections of nucleotide sequences without relying on a reference sequence. When used on collections of highly similar sequences, the ER-index allows to obtain compression ratios which are an order of magnitude smaller than those achieved with the E2FM-index, but maintaining its very good search performance. Moreover, thanks to the ER-index multi-user and multiple-keys encryption model, a single index can store the sequences related to a population of individuals so that users may perform search operations only on the sequences to which they were granted access. The ER-index C++ source code plus scripts and data to assess the tool performance are available at: https://github.com/EncryptedIndexes/erindex.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/10/2019

E2FM: an encrypted and compressed full-text index for collections of genomic sequences

Next Generation Sequencing (NGS) platforms and, more generally, high-thr...
08/04/2019

Matching reads to many genomes with the r-index

The r-index is a tool for compressed indexing of genomic databases for e...
06/14/2021

MetaCache-GPU: Ultra-Fast Metagenomic Classification

The cost of DNA sequencing has dropped exponentially over the past decad...
05/25/2019

EPCI: A New Tool for Predicting Absolute Permeability from CT images

A new and fast Matlab algorithm for predicting absolute permeability is ...
05/30/2019

A Hippocampus Model for Online One-Shot Storage of Pattern Sequences

We present a computational model based on the CRISP theory (Content Repr...
08/28/2019

Techniques for Inverted Index Compression

The data structure at the core of large-scale search engines is the inve...
09/24/2020

Compressed Key Sort and Fast Index Reconstruction

In this paper we propose an index key compression scheme based on the no...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predictive, preventive, precise and participatory medicine (P4 medicine, for short) are new approaches underpinned by genome sequencing that will soon be incorporated in our health systems. The advantages of these approaches for human health and wellbeing can be very significant according to Sagner et al. (2017): by reshaping healthcare from reactive to proactive they indeed represent the main answer to the progression of “silent” chronic diseases, which are the leading cause of death, disability and diminished quality of life in the developed world, strongly impacting the economy of many countries.

However, such new approaches pose very big computational and security challenges. Data management in genomics is considered a “four-headed beast” due to the high computational costs concerning the lifecycle of a data set: acquisition, storage, distribution and analysis. The total amount of sequenced data doubles approximately every seven months, and Stephens et al. (2015) have calculated to be about 1 zetta-bases acquisition per year and from 2 to 40 exa-bytes of data storage per year the projected computational needs in 2025.

On the other hand, the storage of such huge amount of data raises concerns about privacy and security. Human genome projects were initially open access, since it was believed that there was no risk of identification of participants or donors, but this approach was overturned after Homer et al. (2008) realized that data from individuals could be distinguished in Genome Wide Association Studies (GWAS) just using summary statistics. In this respect, the most effective tools for protecting data without compromising their usability are given by modern cryptography, which offers algorithms and protocols for accessing and managing data in much more complex use case scenarios than the classical two-party and “plaintext-or-ciphertext” settings. Nonetheless, the choice of the cryptographic algorithms and protocols and their implementation have to be taken seriously, otherwise the resulting system could be inefficient, and/or not adequately protected by advanced attacks (e.g. ciphertext-chosen attacks, side-channel attacks as defined in Menezes et al. (2010)) or emerging computing platforms (e.g. quantum computers).

1.1 Related work

Montecuollo et al. (2017) introduced the -index, a full-text index in minute space which was optimized for compressing and encrypting nucleotide sequence collections, and for performing fast pattern-search queries on them, without the knowledge of a reference sequence. The -index is particularly suitable for metagenomics or de-novo discovery applications: it occupies till to 1/20 of the storage required by the input FASTA file, saving about 95% of storage space, whereas the gap in pattern search performance due to encryption has no practical significance, being of the order of milliseconds in any case.

However, the -index is not suitable to compress genomic sequences of multiple individuals given a reference sequence. Moreover, the -index encryption model does not allow the use of multiple encryption keys for multiple sequences within the same index. Thus, a new index must be created for each new set of sequences whose access must be separately authorized. In turn, this can result in searching for patterns in several indexes, potentially slowing down search performance.

The -index first processes data thanks to the Burrows-Wheeler transform (BWT) and the Move-to-front (MTF) transform, after which it compresses them with the RLE0 algorithm. The BWT approach does not seem so appropriate for referential compression as dictionary-based methods, thus we have adopted another compression strategy in the -index, based on LZ77 algorithm.

Lempel-Ziv methods are lossless, dictionary-based compression algorithms which replace repetitions in a string by using references of their previous occurrences. There are many variants, all derived from the two algorithms introduced by Ziv and Lempel (1977, 1978) and named LZ77 and LZ78, respectively.

Most of the self-indexes inspired to the Lempel-Ziv parsing use LZ78, because the LZ78 factorization of a text has some interesting properties which allow to design efficient pattern search algorithms like that of Sirén et al. (2009). LZ78 is faster but more complex than LZ77, since it constructs a dictionary which tends to grow and fill up during compression. Actually this happens all the time for big inputs like in our application scenario, and the common methods to overcome such issue (see Salomon (2004)) do not permit to gain the most advantage from the high similarity of genomic sequences.

The first self-index based on LZ77 was presented by Kreft and Navarro (2011): it offers good compression ratio and search performance, but its internal data structures were not designed to explicitly handle a collection of data items. This index also does not exploit the fundamental requisite of our application domain, that is the compression of genomic sequences relative to a reference sequence.

The first attempt to compress a collection of individual genomes with respect to a reference sequence was made by Brandon et al. (2009). That work, like those of Kuruppu et al. (2010) and Kuruppu et al. (2011), aimed to build data structures suitable to efficiently compress the collection, while allowing fast random access to parts of it. Pattern search still remained an open question.

The problem of efficiently searching for patterns in a such index was addressed and resolved later by Wandelt et al. (2013), but some of the data structures used therein do not allow the encryption of sequences related to different individuals with distinct keys.

1.2 Paper contribution and organization

In the present work we introduce -index (Encrypted Referential index), the first encrypted self-index based on referential Lempel-Ziv compression and designed so that it can be the core of a multi-user database engine.

When used on collections of highly similar sequences, the -index allows to obtain compression ratios which are an order of magnitude smaller than those of -index, but maintaining its optimal search performance. Moreover, the -index multi-user encryption model permits to store genomic sequences of different individuals with distinct encryption keys within the same index. This allows the index users to perform search operations only on the sequences to which they were granted access.

The paper is organized as follows. Section 2 gives an overview of the main features of -index, alongside with the computational methods and data structures which make possible such features. Sections 3 and 4 give details on its core algorithms, pointing out some important differences of our approach with respect to current computing techniques for genomic databases. Section 5 illustrates how to construct a file-system based genomic database using the -index, and section 6 reports and discusses the results of the tests we have run in order to assess the performance of our tool versus a state-of-the-art index, the wavelet tree FM-index by Grossi et al. (2003). Finally, Section 7 sums up the main features of the -index and sketches out future work.

2 System and Methods

The -index is an open-source C++ tool designed to handle an encrypted genomic database. It is a full-text index consisting substantially in two major components:

  • a set of relative Lempel-Ziv factorizations, one for each sequence of the collection;

  • a set of auxiliary data structures to support encryption and search operations.

Both the factorizations and the auxiliary data structures are designed to permit efficient pattern searching, while allowing users to search only on sequences in the database to which they were granted access.

In order to apply encryption with a small overhead in searching and compression performance, each factorization is splitted in a series of fixed-length blocks of factors, so that each of them contains the same number of factors. Each block is then processed indipendently, so to produce a compact representation whose size depends on the compressibility of the information addressed by its factors. Finally, the variable size blocks previously obtained are independently encrypted from each other using the Salsa20 cipher of Bernstein (2005). Salsa20 was one of the ciphers selected as part of the eSTREAM portfolio of stream ciphers (see Babbage et al. (2008)), and has been designed for high performance in software implementations on Intel platforms. It produces a keystream of bytes from a 256-bit key and a 64-bit arbitrary nonce which is changed for each new run with the same key. It subsequently encrypts a sequence of bytes plaintext by XOR-ing it with the first bytes of the stream, discarding the rest of the stream.

A main point in protecting long-term, sensitive information – as that provided by genomic databanks – is to provide encryption methods which can outstand advanced attacks and next generation computing paradigms and platforms. As of 2019 there are no known attacks on Salsa20, and the 15-round reduced version of this cipher was proven 128-bit secure against differential cryptanalysis by Mouha and Preneel (2013). Moreover, according to Grote et al. (2019), it is resistant against side channel attacks and the new emerging quantum computing platforms.

2.1 Relative Lempel-Ziv factorization

Let be a finite string of symbols over a finite alphabet . Lempel-Ziv methods consist in rules for parsing into a sequence of factors, so to replace repetitions in by using references of their previous occurrences. Factors contain indeed references to a dictionary of substrings in . The difference between the LZ77, LZ78 and the relative Lempel-Ziv factorization is that both the LZ77 and LZ78 build their dictionary “on the fly”, putting in it substrings encountered in before the current scanning position, whilst the relative Lempel-Ziv factorization obtains its compression by comparing the text to an already existing dictionary.

In the context of our application domain, is a genomic sequence of an individual belonging to a given species for which a reference sequence is known. As it is well known in Genomics is very similar to , presenting only a few number of mutations, deletions and insertions, often in a percentage not greater than 1%. Thus using to construct the dictionary rather than can allow a better compression of : indeed, a given portion of is more similar to the corresponding portion of the reference sequence than to a previously seen substring of . This is the basic idea of the so-called Referential Genome Compression, which can be implemented thanks to Relative Lempel-Ziv factorization.

Definition 2.1

Relative Lempel-Ziv factorization Let and two finite strings over the same finite alphabet . The Relative Lempel-Ziv factorization of with respect to the reference , denoted as , is a sequence of factors

Each factor is a triple , where:

  • is the position of the longest substring in matching the current substring in

  • is the length of ;

  • (a.k.a. mismatch character) is the last character in , so that .

2.2 B+ trees

B Trees and their B+ variant (Zhang (2004)) are dynamic balanced trees whose nodes contain data values and their related search keys. They are often used for databases and file system indexing due to the fast search operation they allow to perform. The main difference between B and B+ trees is that the former allows every node to contain data values, while in the latter these can be found only in leaf nodes, with every other node containing only search keys.

Definition 2.2

B+ tree Let be a positive integer. A B+ tree of order is a tree with the following properties:

  • All leaf nodes are in the same level, i.e. the tree is balanced;

  • Every non leaf node contains multiple search keys, stored in increasing order, which act as separation values for its subtrees: the left sub-tree of a key contains values lower than those of the father key, and the right sub-tree contains the values greater than those of its father key;

  • Every internal node contains at least keys and at most keys;

  • If the root node is not a leaf node, it has at least one key and at most keys.

B+ trees support efficient updates and exact match queries, which find the values associated to a given key. They also permit to do efficiently an operation known in literature as range query, which finds all the values related to keys in the interval .

Tipically B Trees nodes are stored on secondary storage as fixed size disk pages, whose size is a multiple of the hosting file system page size. In order to increase the number of keys and pointers stored in each page, a compression scheme can been applied which takes advantage of the fact that keys in a node are very close each other. For the B+ trees implemented in the ER-index was used the Invariable Coding method of Krizka et al. (2009), which for each node stores the first key and the differences between any other key and the first one, using the minimum number of bits required to express the difference of the last key from the first key.

2.3 The ER-index

Let be a collection of sequences corresponding to different individuals. Let and a reference sequence and its reverse. Let be the relative LZ-factorization of sequence with respect to . Let be the sequence of -length blocks factors gotten from , where denotes the obtained number of blocks and each block contains exactly factors. Finally, let denote the Salsa20 encryption of with a 256-bit secret and a 64-bit .

The -index stores each of the aforementioned blocks encrypted, using a different secret key for each individual and the block number as nonce, so that the encryption of factorization is given by:

In order to speed up search operations, we have designed the Encrypted B+ tree ( tree), an extended and encrypted variant of the tree data structure. Thanks to an tree each single factor of the encrypted collection is associated to the right identifier and encryption key .

Before performing encryption, we use Invariable Coding for both node search keys and values, but in a different way than in Zhang (2004). Indeed, the authors of that work applied compression to arrange more values into fixed-size node pages, whereas we used it in order to obtain smaller variable length nodes, thus minimizing the overall index size.

Definition 2.3

Encrypted Referential index An Encrypted Referential index (-index) for a collection of sequences with respect to a reference sequence and a set of encryption keys is a self-index consisting of:

  • the encrypted relative Lempel-Ziv factorizations of sequences with respect to : ;

  • a set of three EB+ trees whose search keys are respectively:

    1. , a suffix array index corresponding to a suffix prefixing the reverse of the factor referential part;

    2. , a suffix array index corresponding to a suffix prefixing the factor referential part;

    3. , the position of the factor referential part in the reference sequence ;

    and whose values are the couples , where identifies the sequence and is the Lempel-Ziv factor of the related genomic sequence, encrypted with key .

3 Factorization algorithm

The factorization algorithm used to build the -index slightly differs from that proposed by Kuruppu et al. (2010) and Wandelt et al. (2013), as the -index uses a couple of FM-indexes to represent the reference sequence and its reverse . The factor is again a triple of numbers, but in the -index they have a different meaning:

  • is the suffix array index from which to start the backward scan of in order to obtain the factor;

  • is the length of the factor, comprehensive of the mismatch character;

  • is the mismatch character.

In order to speed-up pattern search, the algorithm retrieves also the three auxiliary data , and stored as search keys in the corresponding trees composing the -index (see Definition 2.3). The algorithm, whose pseudo-code is given by Algorithm 1, uses four data structures related to the reference sequence :

  • the FM-index of ;

  • the FM-index of the sequence gotten by reversing ;

  • a correspondence table , which maps a suffix of to the suffix starting from the same character;

  • the reverse correspondence table , which maps a suffix of to the suffix starting from the same character.

Given a sequence , Algorithm 1 scans from left to right and at each step it tries to factorize the suffix by searching the maximum-length referential factor starting from . For this purpose it scans the BWT of through , starting from and proceeding backward on until a mismatch is found. This backward search gives as result the suffix array range containing the suffixes prefixing the reverse of ; the algorithm choose the first among them, as they are all equivalent for its purposes.

The further processing of the algorithm consists in retrieving the auxiliary information related to the previously found factor. The and functions exactly match the canonical FM-index implementation, so they are not reported as pseudo-codes.

1:function Factorize(,,,,)
2:      Current factor index
3:     ; Maximum factor length
4:     ;
5:     ;
6:     while  do
7:          Retrieve the next factor
8:         ; Curr char, not remapped in the FM index
9:          Curr length of the next factor ref part
10:         if  AND  then
11:              ;
12:               Start a backward search on the rev ref index
13:              ; Remap curr char
14:              ;
15:              ;
16:              ;
17:               Backward search stops when the ref part includes the last
18:               but one char of S OR the next char is not in the ref sequence
19:               OR the last backward step was not successful OR the
20:               next char is N and the last is not OR viceversa
Algorithm 1 Factorization algorithm
21:              while  AND
22:                       AND
23:                       AND
24:                       AND OR
25:                       AND do
26:                  ;
27:                  
28:                      ;
29:                  
30:                      ;
31:                  if  then
32:                       ;
33:                       ;
34:                       ;
35:                       ;
36:                  else
37:                       ;
38:                  end if
39:                  ;
40:              end while
41:              ;
42:              
43:               Find , and , as follows:
44:               Find of for of
45:              ;
46:               Do back steps on FM index to find
47:              for  To  do
48:                  ;
49:              end for
50:              ;
51:               Find position of for using
52:               index marked rows
53:              
54:               Do one back step on FM index to find
55:              ;
56:               Store the retrieved factor in the factors array
57:              ;
58:         end if
59:         
60:     end while
61:end function
Algorithm 1 Factorization algorithm (continued)

4 Pattern search algorithm

ER-index supports exact pattern matching through algorithm

2. Before describing the algorithm details, it is appropriate to make some considerations. A pattern search operation on a Lempel-Ziv factorization can retrieve two types of occurrences:

  • internal occurrences, which are completely contained in a factor’s referential part;

  • external occurrences, also known in literature as overlapping occurrences, which have at least a character outside of a factor’s referential part.

External occurrences can span two or more factors or end with a factor’s mismatch character, and a solution to find them on factorizations is proposed in Navarro (2002). The search pattern is splitted in all possible ways and, for each split point, the algorithm searches for the right side prefix and the reverse left side prefix in two related trie data structures (Brass (2008)). This results in two sets, the factors ending with the pattern’s left side and the factors starting with the pattern’s right side, and the algorithm eventually joins these two sets in order to obtain couples of consecutive factors. This approach can be applied also to relative Lempel-Ziv factorizations, but in our case it would require two tries for each individual, which is very expensive in term of disk space. Thus the function of Algorithm 2 follows a similar approach, but it makes use of the following less expensive data structures:

  • the and indexes of algorithm 1, in order to search for the maximal prefix of the reversed left side in and for the maximal right side prefix in , respectively;

  • a couple of trees to retrieve the factors ending with the maximal prefix of the reversed left side and those starting with the maximal right side suffix, respectively.

As for internal occurrences, the approach in Navarro (2002) is based on the fact that each factor is the concatenation of a previous factor with an additional character, which is not true for relative Lempel-Ziv factorizations.

Therefore function of Algorithm 2 implements an original approach which uses once again the index of the reference sequence , together with a third tree, named , whose search keys are the starting positions of factors referential parts in . This last tree allows to retrieve the factors whose referential part starts in a given positions range of the reference sequence. The function also uses the auxiliary information , defined as the maximum length of all factors contained in the -index, which is determined during the factorization process and is stored into the index header.

Since an internal occurrence of the pattern is completely contained in the reference sequence, the first step implemented in could have been to retrieve all the pattern’s occurrences in the reference sequence. However, we have first to check that an individual sequence factor containing the reference sequence occurrence really exists, and then retrieve the occurrence positions in the individual sequence previously found.

These issues have been addressed thanks to the following consideration. Let us suppose that the suffix array interval is the result of a pattern search on the reference sequence. The position of each interval’s element in the reference sequence can be retrieved using the related reference index marked rows. Given a factor, let also be the length of its referential part, its starting position in the reference sequence, and the pattern length. A reference occurrence located in is also an individual sequence occurrence if and only if:

(1)

The first condition is to make sure that a factor’s referential part does not start after the first character of the reference occurrence, while the second that the factor’s referential part does not end before the end of the reference occurrence. If both the above range bounds were fixed values, the referential part factors could be retrieved by performing a range query on . The lower bound actually is not a fixed value, since it depends from the length of the referential part of the factor, but we can consider the maximum length of all factors . Because of (1), the wrong values returned by a range query based on can indeed be filtered out by keeping only those factors complying to .

1:
2:function Locate()
3:     ;
4:     ;
5:     ;
6:      Sort each individual occurrence by position
7:     ;
8:      Remove any duplicates
9:     ;
10:      Find each occurrence text position from its factor identifier
11:      and its factor offset
12:     ;
13:     return ;
14:end function
Algorithm 2 Pattern search algorithm
15:
16:function LocateExternalOccs()
17:     
18:     ;
19:     for  to  do
20:         ;
21:         if  then
22:               The left side part (lsp) is longer than the right side part (rsp),
23:               so the factors expected to end with the lsp are less than those
24:               expected to start with the rsp
25:              ;
26:              ;
27:               Scan through the individual identifiers
28:              for  in  do
29:                   Get the factorization from an associative array
30:                   with all the individual factorizations
31:                  ;
32:                  for  in  do
33:                       ;
34:                        Exclude that the left side crosses the starting point
35:                        of the current factor,since an occurrence of this type
36:                        will be found for a preceding split point
37:                       if  then
38:                           if  then
39:                                ;
40:                                ;
41:                                ;
42:                                ;
43:                                ; Left side verified length
44:                                ; Right side verified length
45:                                if  then
46:                                    ;
47:                                end if
48:                           end if
49:                       end if
50:                  end for
51:              end for
52:         else
53:               The right side part (rsp) is not shorter than the left side
54:               part (lsp), so the factors expectedto start with the rsp
55:               are less than those expected to end with the lsp
56:              ;
57:              ;
Algorithm 2 Pattern search algorithm (continued)
58:              for  in  do
59:                   Get the factorization from an associative array
60:                   with all the individual factorizations
61:                  ;
62:                  for  in  do
63:                       ;
64:                       if  then
65:                           ; Right side verified length
66:                       else
67:                           ;
68:                       end if
69:                       if  then
70:                           ;
71:                           if  then
72:                                ;
73:                                ;
74:                                ;
75:                                ;
76:                                 Left side verified length
77:                                if  then
78:                                    ;
79:                                end if
80:                           end if
81:                       end if
82:                  end for
83:              end for
84:
85:         end if
86:     end for
87:     return occs;
88:end function
89:
90:function LocateInternalOccs()
91:     
92:      An internal occurrence occurs certainly in the ref seq
93:     if  then
94:         for  to  do
95:              ;
96:              ;
97:               lmax is the length of the maximum factor in the index
98:              
99:                      ;
100:              for each distinct in  do
101:                   Retrieve the factorization from an associative array
102:                   containing all the individual factorizations
103:                  ;
Algorithm 2 Pattern search algorithm (continued)
104:                  for  in  do
105:                       ;
106:                       ;
107:                       ; factor refential part length
108:                       if  then
109:                           ;
110:                           ;
111:                           ;
112:                           
113:                      ;
114:                            Left side verified length
115:                           ;
116:                       end if
117:                  end for
118:              end for
119:         end for
120:     end if
121:     return ;
122:end function
123:
124:function PatRemPart()
125:      Check if the occurrence is really a whole pattern occurrence,
126:      updating it if required. It returns true for successful checks.
127:      This function tries to extend the verified part of the pattern, both
128:      on the left and the right side, by comparing the yet not verified
129:      pattern characters with the factors characters preceding and
130:      following the verified part. For performance the extension is
131:      made without extracting the full text of the involved factors,
132:      but scanning the text one character at a time thanks to the
133:      reverse reference index.
134:end function
135:
136:function FindLeftSideFactors()
137:      Return , where is a list of factors
138:      ending with the left side longest suffix, and is the left side
139:      longest suffix length.
140:      Find the longest left side suffix that occurs in the reference string
141:     ;
142:     ;
143:     if  then
144:          Find factors whose suffixArrayPosition is in [sp,ep]
145:         return ;
146:     else
147:         return [[],0];
148:     end if
149:end function
Algorithm 2 Pattern search algorithm (continued)
150:
151:function FindRightSideFactors()
152:      Return , where is a list of factors
153:      beginning with the right side longest prefix, and is the
154:      right side longest prefix length
155:      Find the longest right side prefix occurring in the reference string
156:     ;
157:     ;
158:     if  then
159:          Find factors whose suffixArrayPosition is in [sp,ep]
160:         return ;
161:     else
162:         return [[],0];
163:     end if
164:end function
165:
166:function FindRightSideLongestPrefix()
167:      Scan backward the reverse index, starting from the first char
168:      of the right side and going on until a mismatch is found.
169:      Return the right side longrst prefix .
170:end function
171:
172:function FindLeftSideLongestSuffix()
173:      Scan backward the straight index, starting from the last char
174:      of the left side and going on until a mismatch is found.
175:      Return the left side longest suffix .
176:end function
177:
178:function searchPatInRefIndex()
179:      Perform a canonical backward search on the given FM-index,
180:      returning the [sp,ep] suffix array range corresponding to
181:      pattern .
182:end function
183:
184:function searchPatRevInRefIndex()
185:      Perform a backward search on the given FM-index, starting from
186:      the first pattern char, then the second char, and so on.
187:      Return the [sp,ep] suffix array range corresponding to
188:      pattern .
189:end function
190:
191:function GetIndRetrFactors()
192:      Returns the indexes of factors belonging to the individual indId,
193:      selecting them from the collection , which contains factors
194:      belonging to several individuals
195:end function
Algorithm 2 Pattern search algorithm (continued)

5 Implementation

The -index is designed to be the building block of an encrypted database, that stores genomic information about a possibly large set of individuals. Roughly speaking, an Encrypted Referential database (-database for short) is a collection of -indices whose access is managed through portfolios of secret keys related to a population of individuals and a set of database users:

Definition 5.1

Encrypted Referential Database Let be a set of reference sequences for human chromosomes. Let denote a set of individuals and be a set of genomic sequences, where is the sequence of individual related to chromosome . An Encrypted Referential Database (ER-database) for with reference is a tuple

where:

  • is a set of randomly-generated, symmetric encryption keys so that is uniquely and secretly associated to for ;

  • is a set of database users, where each is allowed to access only to the sequences of a subset of the individuals in ;

  • is a set of ER-indexes for the population , each one relative to a different sequence in ;

  • is a mapping from to that, for each user , indentifies the individuals in whose access is granted to .

A simple implementation provides for an -database hosted by a file system directory, named the database root. The database root contains the database catalog catalog.xml, which lists all the individuals, users and reference sequences composing the database. Moreover, in the database root there are the subdirectories references and indexes containing the sets and , respectively; and the subdirectory security, which contains the key portfolios of the database users in . The key portfolio for a database user contains only the Salsa20 keys related to the individuals in whose genomic information has been granted access; it is handled with asymmetric encryption techniques, and enciphered with the ’s public key so that only can read its content by using his/her private key.

Figure 1: A simple scenario of key portfolios where user can access only genomes of individuals and , whilst user has access to the genome of all the individuals in the database.

6 Experimental results

In order to evaluate the -Index performance, a small ER-database concerning 50 individuals and 10 users was implemented as described in the previous section.

Using such database, a comprehensive set of tests was performed on different computing platforms to measure the compression ratios (i.e.; the ratio of the output data size to the input data size as defined by Salomon (2004)), and the times required to build the index and search for patterns. The results were also compared with a “state of the art” software built thanks to the Sdsl C++ library by Gog and Petri (2014), available at http://github.com/simongog/sdsl. This library provides some succint data structures for implementing self-indexes like the Compressed Suffix Arrays (CSA) by Grossi and Vitter (2005) and the wavelet tree FM-indexes by Grossi et al. (2003), and we extended this last kind of implementation so to manage collections of items and report sequence-relative locations.

The tests were performed on three computing platforms having different resources as follows, in order to assess the performance of our tool and assess its effectiveness with respect to the reference tool for different CPUs, memory sizes and operating environments:

  • ser, a small-size server with an Intel(R) Xeon(R) CPU E5-2697 v2 at 2.7GHz 24 cores processor and 180GB of DDR3 1333 MHz memory, running the CentOS 7 operating system;

  • lap, a laptop with an AMD A10-9600P at 2.4GHz 6 cores processor and 12GB of DDR4 1866 MHz memory, running an Ubuntu on Windows application on the Windows 10 operating system with the Microsoft-Windows-Subsystem-Linux turned on;

  • clu, a node of a computing cluster with 2 Intel Xeon CPU E5-2670 at 2.6GHz 10 cores processor and 196GB of DDR3 1600 MHz memory, running the CentOS 6 operating system.

6.1 Experimental setup

The individual sequences chosen to assess the prototype performance are those related to human chromosomes 11 and 20, for a population of 50 individuals. Chromosome 11 (135,086,622 base pairs) and 20 (64,444,167 base pairs) were chosen as representatives of big and small human chromosomes, respectively. The sequences were of two types:

  • Diploid consensus sequences obtained from the 1000 Genomes Project (www.internationalgenome.org/home). These sequences were built by starting from the respective BAM files and using the samtools mpileup (www.htslib.org) command along with the bcftools and vcfutils utilities.

  • Pseudo-random sequences obtained by applying single mutations, insertions and deletions to the corresponding chromosome reference sequence in the human genome bank HS37D5, a variant of the GRCh37 human genome assembly used by the 1000 Genomes Project. For this purpose Montecuollo et al. (2017)

    built a tool which selects, with uniform distribution, mutations, insertions and deletions according to the percentages observed on average by

    Mullaney et al. (2010) among different individuals of the human species.

Although artificially generated, the second kind of sequences is more appropriate than consensus sequences to evaluate real performances, since they are free from spurious symbols caused by sequencing machines errors or inaccuracy.
For each one of the two above types we considered full length sequences and 1MB sequences, obtained by selecting one million basis of those chromosomes. Thus we performed our tests on a total of eight kinds of genomic collection sequences, with consensus collections denoted as , , , , and their artificially generated counterparts identified by the suffix . Some tests include also 5MB sequences, obtained by selecting five million basis from chromosomes 11 and 20.

The encryption set-up consisted in the generation of fifty 256-bit simmetric keys through the openssl rand command, plus ten RSA key couples using the openssl genrsa and openssl rsa commands. The key portfolio for each of the database users was generated by choosing a subset from the pool of simmetric keys and ciphering it with the user’s public key.

6.2 Construction times

Tables 1 and 2 show times required to construct the ER-index on the three considered computing platforms; moreover, the first table reports a comparison with the reference tool on ser. Similar results were obtained on the other two platforms, showing that times required to build the ER-index – except than for some very short (1MB) sequences – are significantly lower than those for the Sdsl wavelet-tree FM-index , despite the fact that only the ER-index implements data encryption.

20_1MB 20_5MB 20_FULL 11_1MB 11_5MB 11_FULL
ER 11.38 33.74 455.7 23.18 42.79 1005
Sdsl 20.08 132.1 2061 19.33 154.9 5693
Table 1: Times (sec) required to build the ER-index (ER) and the Wavelet-tree FM-index (sdsl) on the ser platform.

This noticeable perfomance has been obtained through our parallel factorization algorithm, which exploits the multi-core, hyper-threading architecture of modern processors (see Section 3). As it can be easily inferred by a comparison of the obtained values, the speed-up increases with the number of cores, so it could be greater on higher-end machines with more processor cores.

20_1MB 20_1MB_R 20_FULL_R 11_1MB 11_1MB_R 11_FULL_R
lap 50.09 47.64 208775 63.64 69.32 5409862.50
clu 9.91 9.81 256.25 19.78 16.24 528.54
Table 2: Times (sec) required to build the -index on the lap and clu platforms.

Note that the full collections have notable sizes (about 2.97 and 6.4 GiB for the and sets, respectively), and this resulted in long computing times (about 0.58 and 1.5 hours) on the lap

platform. However, we believe these last results are not very indicative since probably due to an improper memory management by the virtual machine.

6.3 Compression ratios

Figure 2 reports the compression ratios of the -index versus the wavelet-tree FM-index on the ser computing platform for the collections obtained from the 1000 Genomes Project. Obviously, similar results were obtained on the lap and clu platforms, showing that the -index got compression ratios about four times smaller than the reference tool.

((a)) Chromosome 20
((b)) Chromosome 11
Figure 2: -index (ER) versus Wavelet-tree FM index (Sdsl) compression ratios for chromosome 20 and chromosome 11 collections on the ser platform.

The compression ratios values on pseudo-random sequences were about half than those for the previous sequences: since the sequences created from an algorithm lack of spurious symbols, it is possible to find longer matches between the analyzed sequence and the reference one, and achieve a better compression performance. Overall (see Table 3) this is very good for the ER-index, resulting in at least 97% savings in space. For example, the 6.4 GiB of collection resulted in an index smaller than 192 MiB.

20_1MB_R 20_FULL_R 11_1MB_R 11_FULL_R
0.026 0.028 0.026 0.030
Table 3: ER-index compression ratios on the collections of pseudo-random sequences.

It can be worth to note here that the reported figures are mean values obtained by building the index more times (we usually performed 18 index builds for each collection, in order to filter out spurious computing time values due to umpredictable overheads from other processes running on the same platform). As a matter of facts, the multithreading approach causes the operations to be executed in different order during the factorization, thus a different order in the creation of the auxiliary data structures. But, since such data structures are B+ trees, depending on the order of the insertion of the keys there might be different splits in their nodes, resulting in small changes in the size of the index.

6.4 Pattern search performance

For each collection given in Section 6.1, the tests to evaluate pattern search (a.k.a. locate operation) performance were run as follows:

  • the index (-index or wavelet-tree FM-index) related to the given collection was selected, and only its header was loaded in memory;

  • for each pattern length , patterns were randomly extracted from the sequences composing the collection, and all of them were searched through the index;

  • mean and median values of the 500 search times and search times per occurrence got at the previous step were computed;

  • the index was closed, and the next test was performed.

Figure 3 plots the search time values obtained on the ser platform for collections and , whereas Table 4 sums up the results obtained for the other collections on the lap and clu platforms.

For full collections, pattern search times should be proportional to the number of found occurrences, and thus they should decrease with pattern length, since a bigger turns out in less chances to find a pattern. However, the obtained results clearly show that for such collections search times are higher for than but they increase afterward. This is due to the algorithm used for external occurrences (see Section 4), since the number of split points checked increases with the pattern length. However, this behaviour could be noticeably improved by parallelizing the several split points operations, which are naturally independent from each others.

Another interesting observation which follows from the obtained results is that median values are significantly smaller than mean values. This is because of a 10-15% of outliers, due to some patterns hard to search, or to the fact that some searches were performed right after the opening of an index, when only a small amount of factorizations and EB+ blocks were loaded in memory. Patterns may be hard to search since they have a much greater number of occurrences than the other patterns of the same length, or because they span on many short factors so that the left or right side related to some split points are very short strings.

((a)) Mean search times
((b)) Median search times
Figure 3: ER-index mean and median pattern search times (ms) for collections and on the ser platform.

Overall these results show that locate operations are executed very fast thanks to the -index: also on a small computing platform running a virtual machine like lap they take less than half a second in the worst case (i.e.; looking for 500-basis patterns in the collection of 6.4 GiB). A comparison with the wavelet-tree FM-index has shown that this last performs better on patterns with , but slightly worse on short patterns. These differences are however of the order of hundredths of a second, so they are ininfluential from a practical point of view, except in application scenarios where a massive amount of pattern searches is required. Since pattern search times are proportional to the number of found occurrences, it is appropriate to look at the mean and median values of the search time per occurrence, reported in Figure 4 for the ser platform and the two collections and . These results show that mean search times per occurrence grow with pattern length, starting from a few milliseconds for to a maximum of 190.622 ms for on the collection. Note that the curves of the median values related to the two collections perfectly overlap. This attests the scalability of the -index: collections of increasing size can be managed without a significant loss in performance.

20_1MB 20_1MB_R 20_FULL_R 11_1MB 11_1MB_R 11_FULL_R
20 2.11 1.41 45.19 4.96 0.99 122.41
50 2.71 1.89 7.89 3.07 1.73 45.72
100 7.63 6.51 12.38 8.00 5.77 19.39
200 47.93 34.93 52.20 37.60 33.05 61.95
500 363.47 359.59 382.79 371.36 367.98 408.45
20 0.40 0.29 4.19 0.58 0.31 8.84
50 0.92 0.80 1.61 1.00 0.81 1.72
100 3.04 2.78 4.25 3.18 2.87 4.70
200 13.17 12.39 14.88 13.62 13.04 16.33
500 86.97 83.25 85.20 87.72 86.45 91.89
Table 4: -index search time mean values (ms) on platforms lap and clu.
((a)) Mean search times
((b)) Median search times
Figure 4: -index mean and median search times (ms) per occurrence for collections and on the ser platform.

7 Conclusion and future work

The ER-index is a new tool designed to be the core of secure genomic databases: it exploits inter-sequence redundancy to get very low compression ratios, and stores the sequences of different individuals so that they are encrypted on disk with different encryption keys within the same index. This new index can store collections of genomic sequences in less than 1/30 of their original space, outperforming state of the art tools like the wavelet tree FM-index, and offering in addition the critical feature of a built-in, quantum-resistant encryption. Moreover, the Sdsl library index has to be loaded entirely in RAM to perform any searching operation, whereas the -index loads in RAM only the blocks required to perform the required operation.

The data structures provided with the -index allow for a very good performance in pattern search: our tests have shown that search times per occurrency are less than two tenths of a second on an encrypted and compressed collection of 6.4 GiB. As a matter of fact, the -index ouperforms the wavelet-tree FM-index in searching for short length patterns, whilst it is slower in the order of hundredths of a second for longer patterns. This is a remarkable result, considering that the wavelet-tree FM-index does not operate on encrypted sequences. Morever, times for searching large patterns could be noticeably reduced thanks to multi-threading techniques for the pattern search algorithm.

A multi-threading search strategy and an algorithm for inexact search operations are under investigation and will be implemented in a next release of the -index.

References

  • S. Babbage, C. DeCanniere, A. Cantenaut, C. Cid, H. Gilbert, T. Johansson, M. Parker, B. Preneel, V. Rijmen, and M. Robshaw (2008) The estream portfolio (rev.1). Note: Available at: http://www.ecrypt.eu.org/stream/finallist.html Cited by: §2.
  • D. J. Bernstein (2005) Salsa20 specification. Note: Available at: http://www.ecrypt.eu.org/stream/salsa20pf.html Cited by: §2.
  • M. C. Brandon, D. C. Wallace, and P. Baldi (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25 (14), pp. 1731–1738. External Links: ISSN 1367-4803, Link, Document Cited by: §1.1.
  • P. Brass (2008) Advanced data structures. Cambridge University Press. Cited by: §4.
  • S. Gog and M. Petri (2014) Optimized succinct data structures for massive data. Software: Practice and Experience 44 (11), pp. 1287–1314. Cited by: §6.
  • R. Grossi, A. Gupta, and J. S. Vitter (2003) High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 841–850. Cited by: §1.2, §6.
  • R. Grossi and J. S. Vitter (2005) Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35 (2), pp. 378–407. Cited by: §6.
  • O. Grote, A. Ahrens, and C. Benavente-Peces (2019) A review of post-quantum cryptography and crypto-agility strategies. International Interdisciplinary PhD Workshop. Cited by: §2.
  • N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig (2008) Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics 4 (8), pp. e1000167. Cited by: §1.
  • S. Kreft and G. Navarro (2011) Self-indexing based on lz77. In Proceedings of the 22Nd Annual Conference on Combinatorial Pattern Matching, CPM’11, Berlin, Heidelberg, pp. 41–54. External Links: ISBN 978-3-642-21457-8, Link Cited by: §1.1.
  • F. Krizka, M. Krátkỳ, and R. Baca (2009) Benchmarking a b-tree compression method. In Proceedings of the Conference on Theory and Practice on Information Technologies, Cited by: §2.2.
  • S. Kuruppu, S. J. Puglisi, and J. Zobel (2010) Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In Proceedings of the 17th International Conference on String Processing and Information Retrieval, SPIRE’10, Berlin, Heidelberg, pp. 201–206. External Links: ISBN 3-642-16320-3, 978-3-642-16320-3, Link Cited by: §1.1, §3.
  • S. Kuruppu, S. J. Puglisi, and J. Zobel (2011) Optimized relative lempel-ziv compression of genomes. In Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113, ACSC ’11, Darlinghurst, Australia, Australia, pp. 91–98. External Links: ISBN 978-1-920682-93-4, Link Cited by: §1.1.
  • A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone (2010) Handbook of applied cryptography. CRC press. Cited by: §1.
  • F. Montecuollo, G. Schmid, and R. Tagliaferri (2017) E2FM: an encrypted and compressed full-text index for collections of genomic sequences. Bioinformatics 33 (18), pp. 2808–2817. Cited by: §1.1, 2nd item.
  • N. Mouha and B. Preneel (2013) Towards finding optimal differential characteristics for arx: application to salsa20. Technical report Cryptology ePrint Archive, Report 2013/328. Cited by: §2.
  • J. M. Mullaney, R. E. Mills, W. S. Pittard, and S. E. Devine (2010) Small insertions and deletions (indels) in human genomes. Human molecular genetics 19 (R2), pp. R131–R136. Cited by: 2nd item.
  • G. Navarro (2002) Indexing text using the ziv-lempel trie. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval, SPIRE 2002, London, UK, UK, pp. 325–336. External Links: ISBN 3-540-44158-1, Link Cited by: §4, §4.
  • M. Sagner, A. McNeil, P. Puska, C. Auffray, N. D. Price, L. Hood, C. J. Lavie, Z. Han, Z. Chen, S. K. Brahmachari, et al. (2017) The p4 health spectrum–a predictive, preventive, personalized and participatory continuum for promoting healthspan. Progress in cardiovascular diseases 59 (5), pp. 506–521. Cited by: §1.
  • D. Salomon (2004) Data compression: the complete reference. Springer Science & Business Media. Cited by: §1.1, §6.
  • J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro (2009) Run-length compressed indexes are superior for highly repetitive sequence collections. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval, SPIRE ’08, Berlin, Heidelberg, pp. 164–175. External Links: ISBN 978-3-540-89096-6, Link, Document Cited by: §1.1.
  • Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson (2015) Big data: astronomical or genomical?. PLoS biology 13 (7), pp. e1002195. Cited by: §1.
  • S. Wandelt, J. Starlinger, M. Bux, and U. Leser (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6 (13), pp. 1534–1545. External Links: ISSN 2150-8097, Link, Document Cited by: §1.1, §3.
  • D. Zhang (2004) Handbook of data structures and applications (chapman & hall/crc computer and information science series.). D. P. Mehta and S. Sahni (Eds.), External Links: ISBN 1584884355 Cited by: §2.2, §2.3.
  • J. Ziv and A. Lempel (1977) A universal algorithm for sequential data compression. IEEE Transactions on information theory 23 (3), pp. 337–343. Cited by: §1.1.
  • J. Ziv and A. Lempel (1978) Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory 24 (5), pp. 530–536. Cited by: §1.1.