String Sanitization: A Combinatorial Approach

06/26/2019 ∙ by Giulia Bernardini, et al. ∙ University of Pisa Centrum Wiskunde & Informatica 0

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user's location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct the shortest string preserving the order of appearance and the frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. Second, we propose a time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms may reveal the location of sensitive patterns. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in these strings with carefully selected letters, so that sensitive patterns are not reinstated and occurrences of spurious patterns are prevented. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large number of applications, in domains ranging from transportation to web analytics and bioinformatics feature data modeled as strings, i.e., sequences of letters over some finite alphabet. For instance, a string may represent the history of visited locations of one or more individuals, with each letter corresponding to a location. Similarly, it may represent the history of search query terms of one or more web users, with letters corresponding to query terms, or a medically important part of the DNA sequence of a patient, with letters corresponding to DNA bases. Analyzing such strings is key in applications including location-based service provision, product recommendation, and DNA sequence analysis. Therefore, such strings are often disseminated beyond the party that has collected them. For example, location-based service providers often outsource their data to data analytics companies who perform tasks such as similarity evaluation between strings [15], and retailers outsource their data to marketing agencies who perform tasks such as mining frequent patterns from the strings [16].

However, disseminating a string intact may result in the exposure of confidential knowledge, such as trips to mental health clinics in transportation data [23], query terms revealing political beliefs or sexual orientation of individuals in web data [19], or diseases associated with certain parts of DNA data [17]. Thus, it may be necessary to sanitize a string prior to its dissemination, so that confidential knowledge is not exposed. At the same time, it is important to preserve the utility of the sanitized string, so that data protection does not outweigh the benefits of disseminating the string to the party that disseminates or analyzes the string, or to the society at large. For example, a retailer should still be able to obtain actionable knowledge in the form of frequent patterns from the marketing agency who analyzed their outsourced data; and researchers should still be able to perform analyses such as identifying significant patterns in DNA sequences.

Our Model and Setting. Motivated by the discussion above, we introduce the following model which we call Combinatorial String Dissemination (CSD). In CSD, a party has a string that it seeks to disseminate, while satisfying a set of constraints and a set of desirable properties. For instance, the constraints aim to capture privacy requirements and the properties aim to capture data utility considerations (e.g., posed by some other party based on applications). To satisfy both, must be transformed to a string by applying a sequence of edit operations. The computational task is to determine this sequence of edit operations so that satisfies the desirable properties subject to the constraints.

Under the CSD model, we consider a specific setting in which the sanitized string must satisfy the following constraint C1: for an integer , no given length- substring (also called pattern) modeling confidential knowledge should occur in . We call each such length- substring a sensitive pattern. We aim at finding the shortest possible string satisfying the following desired properties: (P1) the order of appearance of all other length- substrings (non-sensitive patterns) is the same in and in ; and (P2) the frequency of these length- substrings is the same in and in . The problem of constructing in this setting is referred to as TFS (Total order, Frequency, Sanitization). Clearly, substrings of arbitrary lengths can be hidden from by setting equal to the length of the shortest substring we wish to hide, and then setting, for each of these substrings, any length- substring as sensitive.

Our setting is motivated by real-world applications involving string dissemination. In these applications, a data custodian disseminates the sanitized version of a string to a data recipient, for the purpose of analysis (e.g., mining). contains confidential information that the data custodian needs to hide, so that it does not occur in . Such information is specified by the data custodian based on domain expertise, as in [1, 4, 12, 16]. At the same time, the data recipient specifies P1 and P2 that must satisfy in order to be useful. These properties map directly to common data utility considerations in string analysis. By satisfying P1, allows tasks based on the sequential nature of the string, such as blockwise -gram distance computation [13], to be performed accurately. By satisfying P2, allows computing the frequency of length- substrings [21] and hence mining frequent length- substrings with no utility loss. We require that has minimal length so that it does not contain redundant information. For instance, the string which is constructed by concatenating all non-sensitive length- substrings in and separating them with a special letter that does not occur in , satisfies P1 and P2 but is not the shortest possible. Such a string will have a negative impact on the efficiency of any subsequent analysis tasks to be performed on it.

Note, existing works for sequential data sanitization (e.g., [4, 12, 14, 16, 25]) or anonymization (e.g., [3, 5, 7]) cannot be applied to our setting (see Section 7).

Our Contributions. We define the TFS problem for string sanitization and a variant of it, referred to as PFS (Partial order, Frequency, Sanitization), which aims at producing an even shorter string by relaxing P1 of TFS. Our algorithms for TFS and PFS construct strings and using a separator letter , which is not contained in the alphabet of . This prevents occurrences of sensitive patterns in or . The algorithms repeat proper substrings of sensitive patterns so that the frequency of non-sensitive patterns overlapping with sensitive ones does not change. For , we give a deterministic construction which may be easily reversible (i.e., it may enable a data recipient to construct from ), because the occurrences of reveal the exact location of sensitive patterns. For , we give a construction which breaks several ties arbitrarily, thus being less easily reversible. We further address the reversibility issue by defining the MCSR (Minimum-Cost Separators Replacement) problem and designing an algorithm for dealing with it. In MCSR, we seek to replace all separators, so that the location of sensitive patterns is not revealed, while preserving data utility. We make the following specific contributions:

1. We design an algorithm for solving the TFS problem in time, where is the length of . In fact we prove that time is worst-case optimal by showing that the length of is in in the worst case. The output of the algorithm is a string consisting of a sequence of substrings over the alphabet of separated by (see Example 1 below). An important feature of our algorithm, which is useful in the efficient construction of discussed next, is that it can be implemented to produce an -sized representation of with respect to in time. See Section 3.

Example 1

Let , , and the set of sensitive patterns be . The string consists of three substrings over the alphabet separated by . Note that no sensitive pattern occurs in , while all non-sensitive substrings of length have the same frequency in and in (e.g., aaba appears twice) and appear in the same order in and in (e.g., babb precedes abbb). Also, note that any shorter string than would either create sensitive patterns or change the frequencies (e.g., removing the last letter of creates a string in which baab no longer appears).∎

2. We define the PFS problem relaxing P1 of TFS to produce shorter strings that are more efficient to analyze. Instead of a total order (P1), we require a partial order () that preserves the order of appearance only for sequences of successive non-sensitive length- substrings that overlap by letters. This makes sense because the order of two successive non-sensitive length- substrings with no length- overlap has anyway been “interrupted” (by a sensitive pattern). We exploit this observation to shorten the string further. Specifically, we design an algorithm that solves PFS in the optimal time, where is the length of , using the -sized representation of . See Section 4.

Example 2

(Cont’d from Example 1) Recall that . A string is aaababbba#aabaab. The order of babb and abbb is preserved in since they are successive, non-sensitive, and with an overlap of letters. The order of abaa and aaab, which are successive and non-sensitive, is not preserved since they do not have an overlap of letters.∎

3. We define the MCSR problem, which seeks to produce a string , by deleting or replacing all separators in with letters from the alphabet of so that: no sensitive patterns are reinstated in ; occurrences of spurious patterns that may not be mined from but can be mined from , for a given support threshold, are prevented; the distortion incurred by the replacements in is bounded. The first requirement is to preserve privacy and the next two to preserve data utility. We show that MCSR is NP-hard and propose a heuristic to attack it. See Section 5.

4. We implemented our combinatorial approach for sanitizing a string (i.e., all aforementioned algorithms implementing the pipeline ) and show its effectiveness and efficiency on real and synthetic data. See Section 6.

2 Preliminaries, Problem Statements, and Main Results

Preliminaries. Let be a string of length over a finite ordered alphabet of size . By we denote the set of all strings over . By we denote the set of all length- strings over . For two positions and on , we denote by the substring of that starts at position and ends at position of . By we denote the empty string of length 0. A prefix of is a substring of the form , and a suffix of is a substring of the form . A proper prefix (suffix) of a string is not equal to the string itself. By we denote the number of occurrences of string in string . Given two strings and we say that has a suffix-prefix overlap of length with if and only if the length- suffix of is equal to the length- prefix of , i.e., .

We fix a string of length over an alphabet and an integer . We refer to a length- string or a pattern interchangeably. An occurrence of a pattern is uniquely represented by its starting position. Let be a set of positions over with the following closure property: for every , if there exists such that , then . That is, if an occurrence of a pattern is in all its occurrences are in . A substring of is called sensitive if and only if . is thus the set of occurrences of sensitive patterns. The difference set is the set of occurrences of non-sensitive patterns.

For any string , we denote by the set of occurrences of non-sensitive length- strings over . (We have that .) We call an occurrence the t-predecessor of another occurrence in if and only if is the largest element in that is less than . This relation induces a strict total order on the occurrences in . We call the p-predecessor of in if and only if is the t-predecessor of in and has a suffix-prefix overlap of length with . This relation induces a strict partial order on the occurrences in . We call a subset of a t-chain (resp., p-chain) if for all elements in except the minimum one, their t-predecessor (resp., p-predecessor) is also in . For two strings and , chains and are equivalent, denoted by , if and only if and , where is the th smallest element of and is the th smallest of , for all .

Problem Statements and Main Results.

Problem 1 (Tfs)

Given , , , and construct the shortest string :

C1

does not contain any sensitive pattern.

P1

, i.e., the t-chains and are equivalent.

P2

, for all .

TFS requires constructing the shortest string in which all sensitive patterns from are concealed (C1), while preserving the order (P1) and the frequency (P2) of all non-sensitive patterns. Our first result is the following.

Theorem 2.1 ()

Let be a string of length over . Given and , TFS-ALGO solves Problem 1 in time, which is worst-case optimal. An -sized representation of can be built in time.

P1 implies P2, but P1 is a strong assumption that may result in long output strings that are inefficient to analyze. We thus relax P1 to require that the order of appearance remains the same only for sequences of successive non-sensitive length- substrings that also overlap by letters (p-chains).

Problem 2 (Pfs)

Given , , , and construct a shortest string :

C1

does not contain any sensitive pattern.

1

For any p-chain of , there is a p-chain of such that .

P2

, for all .

Our second result, which builds on Theorem 2.1, is the following.

Theorem 2.2 ()

Let be a string of length over . Given and , PFS-ALGO solves Problem 2 in the optimal time.

To arrive at Theorems 2.1 and 2.2, we use a special letter (separator) when required. However, the occurrences of may reveal the locations of sensitive patterns. We thus seek to delete or replace the occurrences of in with letters from . The new string should not reinstate any sensitive pattern. Given an integer threshold , we call pattern a in if and only if but . Moreover, we seek to prevent occurrences in by also bounding the total weight of the letter choices we make to replace the occurrences of . This is the MCSR problem. We show that already a restricted version of the MCSR problem, namely, the version when , is NP-hard via the Multiple Choice Knapsack (MCK) problem [20].

Theorem 2.3 ()

The MCSR problem is NP-hard.

Based on this connection, we propose a non-trivial heuristic algorithm to attack the MCSR problem for the general case of an arbitrary .

3 Tfs-Algo

We convert string into a string over alphabet , , by reading the letters of , from left to right, and appending them to while enforcing the following two rules:

R1: When the last letter of a sensitive substring is read from , we append to (essentially replacing this last letter of with ). Then, if is the longest proper prefix of the succeeding non-sensitive substring (in the t-predecessor order), we append the longest proper suffix of right after .

R2: When the letters before are the same as the letters after , we remove and the succeeding letters (inspect Fig. 1).

Figure 1: Sensitive patterns are overlined in red; non-sensitive are under- or over-lined in blue; is obtained by applying R1; and by applying R1 and R2. In green we highlight an overlap of letters. Note that substring aaaababbb, whose length is greater than , is also not occurring in .

R1 prevents from occurring in , and R2 reduces the length of (i.e., allows to protect sensitive patterns with fewer extra letters). Both rules leave unchanged the order and frequencies of non-sensitive patterns. It is crucial to observe that applying the idea behind R2 on more than letters would decrease the frequency of some pattern, while applying it on fewer than letters would create new patterns. Thus, we need to consider just R2 as-is.

Let be an array of size that stores the occurrences of sensitive and non-sensitive patterns: if and if . For technical reasons we set the last values in equal to ; i.e.,  . Note that is constructible from in time. Given and , TFS-ALGO efficiently constructs by implementing R1 and R2 concurrently as opposed to implementing R1 and then R2 (see the proof of Lemma 1 for details of the workings of TFS-ALGO and Fig. 1 for an example). We next show that string enjoys several properties.

Lemma 1

Let be a string of length over . Given and array , TFS-ALGO constructs the shortest string such that the following hold:

  1. There exists no with occurring in (C1).

  2. , i.e., the order of substrings , for all such that , is the same in and in ; conversely, the order of all substrings of is the same in and in (P1).

  3. , for all (P2).

  4. The occurrences of letter in are at most and they are at least positions apart (P3).

  5. and these bounds are tight (P4).

1 ; ; ;
2 ; /* is the leftmost pos of a non-sens. pattern */
3 if  then /* Append the first non-sens. pattern to */
        4 ; ; ;
        
        5 while  do /* Examine two consecutive patterns */
                6 ; ;
                7 if  then /* If both are non-sens., append the last letter of the leftmost one to */
                        8 ; ; ;
                        
                        9 if  then /* If the rightmost is sens., mark it and advance */
                                10 ; ;
                                
                                11 if  then ;
                                12  /* If both are sens., advance */ if  then /* If the leftmost is sens. and the rightmost is not */
                                        13 if  then /* If the last marked sens. pattern and the current non-sens. overlap by , append the last letter of the latter to */
                                                14 ; ; ;
                                                
                                                15 else  /* Else append and the -length suffix of the current non-sens. pattern to */
                                                        16 ; ;
                                                        17 ; ; ;
                                                        
                                                        
                                                        
18 report
Algorithm 1 TFS-ALGO
Proof

Proofs of C1 and P1-P4 can be found in the appendix. We prove here that has minimal length. Let be the prefix of string obtained by processing the first letters of string . Let . We will proceed by induction on , claiming that is the shortest string such that C1 and P1-P4 hold for . We call such a string optimal.

Base case: . By Lines 1-1 of TFS-ALGO, is equal to the first non-sensitive length- substring of , and it is clearly the shortest string such that C1 and P1-P4 hold for .

Inductive hypothesis and step: is optimal for . If , and this is clearly optimal. If , thus still optimal. Finally, if and we have two subcases: if then , and once again is evidently optimal. Otherwise, . Suppose by contradiction that there exists a shorter such that C1 and P1-P4 still hold: either drop or append less than letters after . If we appended less than letters after , since TFS-ALGO will not read ever again, P2-P3 would be violated, as an occurrence of would be missed. Without , the last letters of would violate either C1 or P1 and P2 (since we suppose ). Then is optimal.∎

See 2.1

Proof

For the first part inspect TFS-ALGO. Lines 1-1 can be realized in time. The while loop in Line 1 is executed no more than times, and every operation inside the loop takes time except for Line 1 and Line 1 which take time. Correctness and optimality follow directly from Lemma 1 (P4).

For the second part, we assume that is represented by and a sequence of pointers to interleaved (if necessary) by occurrences of . In Line 1, we can use an interval to represent the length- substring of added to . In all other lines (Lines 1, 1 and 1) we can use as one letter is added to per one letter of . By Lemma 1 we can have at most occurrences of letter . The check at Line 1 can be implemented in constant time after linear-time pre-processing of for longest common extension queries [9]. All other operations take in total linear time in . Thus there exists an -sized representation of and it is constructible in time.∎

4 Pfs-Algo

Lemma 1 tells us that is the shortest string satisfying constraint C1 and properties P1-P4. If we were to drop P1 and employ the partial order (see Problem 2), the length of would not always be minimal: if a permutation of the strings contains pairs , with a suffix-prefix overlap of length , we may further apply R2, obtaining a shorter string while still satisfying .

To find such a permutation efficiently and construct a shorter string from , we propose PFS-ALGO. The crux of our algorithm is an efficient method to solve a variant of the classic NP-complete Shortest Common Superstring (SCS) problem [11]. Specifically our algorithm: (I) Computes the string using Theorem 2.1. (II) Constructs a collection of strings, each of two symbols (two identifiers): the first (resp., second) symbol of the th element of is a unique identifier of the string corresponding to the -length prefix (resp., suffix) of the th element of . (III) Computes a shortest string containing every element in as a distinct substring. (IV) Constructs by mapping back each element to its distinct substring in . If there are multiple possible shortest strings, one is selected arbitrarily.

Example 3 (Illustration of the workings of Pfs-Algo)

Let and

The collection is , and the collection is (id of prefix id of suffix). A shortest string containing all elements of as distinct substrings is: (obtained by permuting the original string as then applying R2 twice). This shortest string is mapped back to the solution . For example, 13 is mapped back to aaabac. Note, contains two occurrences of and has length , while contains occurrences of and has length .∎

We now present the details of PFS-ALGO. We first introduce the Fixed-Overlap Shortest String with Multiplicities (FO-SSM) problem: Given a collection of strings and an integer , with , for all , FO-SSM seeks to find a shortest string containing each element of as a distinct substring using the following operations on any pair of strings :

  1. ;

  2. -.

Any solution to FO-SSM with and implies a solution to the PFS problem, because for all ’s (see Lemma 1, P3)

The FO-SSM problem is a variant of the SCS problem. In the SCS problem, we are given a set of strings and we are asked to compute the shortest common superstring of the elements of this set. The SCS problem is known to be NP-Complete, even for binary strings [11]. However, if all strings are of length two, the SCS problem admits a linear-time solution [11]. We exploit this crucial detail positively to show a linear-time solution to the FO-SSM problem in Lemma 3. In order to arrive to this result, we first adapt the SCS linear-time solution of [11] to our needs (see Lemma 2) and plug this solution to Lemma 3.

Lemma 2

Let be a collection of strings, each of length two, over an alphabet . We can compute a shortest string containing every element of as a distinct substring in time.

Proof

We sort the elements of lexicographically in time using radixsort. We also replace every letter in these strings with their lexicographic rank from in time using radixsort. In time we construct the de Bruijn multigraph of these strings [6]. Within the same time complexity, we find all nodes in with in-degree, denoted by , smaller than out-degree, denoted by . We perform the following two steps:

Step 1: While there exists a node in with , we start an arbitrary path (with possibly repeated nodes) from , traverse consecutive edges and delete them. Each time we delete an edge, we update the in- and out-degree of the affected nodes. We stop traversing edges when a node with is reached: whenever , we also delete from . Then, we add the traversed path to a set of paths. The path can contain the same node more than once. If is empty we halt. Proceeding this way, there are no two elements and in such that starts with and ends with ; thus this path decomposition is minimal. If is not empty at the end, by construction, it consists of only cycles.
Step 2: While is not empty, we perform the following. If there exists a cycle that intersects with any path in we splice with , update with the result of splicing, and delete from . This operation can be efficiently implemented by maintaining an array of size of linked lists over the paths in : stores a list of pointers to all occurrences of letter in the elements of . Thus in constant time per node of we check if any such path exists in and splice the two in this case. If no such path exists in , we add to any of the path-linearizations of the cycle, and delete the cycle from . After each change to , we update and delete every node with from .

The correctness of this algorithm follows from the fact that is a minimal path decomposition of . Thus any concatenation of paths in represents a shortest string containing all elements in as distinct substrings. ∎

Omitted proofs of Lemmas 3 and 4 can be found in the appendix.

Lemma 3

Let be a collection of strings over an alphabet . Given an integer , the FO-SSM problem for can be solved in time.

Thus, PFS-ALGO applies Lemma 3 on with (recall that ). Note that each time the concat operation is performed, it also places the letter in between the two strings.

Lemma 4

Let be a string of length over an alphabet . Given and array , PFS-ALGO constructs a shortest string with C1, , and P2-P4.

See 2.2

Proof

We compute the -sized representation of string with respect to described in the proof of Theorem 2.1. This can be done in time. If , then we construct and return in time from the representation. If , implying , we compute the LCP data structure of string in time [9]; and implement Lemma 3 in time by avoiding to read string explicitly: we rather rename to a collection of two-letter strings by employing the LCP information of directly. We then construct and report in time . Correctness follows directly from Lemma 4.∎

5 The Mcsr Problem and Mcsr-Algo

The strings and , constructed by TFS-ALGO and PFS-ALGO, respectively, may contain the separator , which reveals information about the location of the sensitive patterns in . Specifically, a malicious data recipient can go to the position of a in and “undo” Rule R1 that has been applied by TFS-ALGO, removing and the letters after from . The result will be an occurrence of the sensitive pattern. For example, applying this process to the first in shown in Fig. 1, results in recovering the sensitive pattern abab. A similar attack is possible on the string produced by PFS-ALGO, although it is hampered by the fact that substrings within two consecutive s in often swap places in .

To address this issue, we seek to construct a new string , in which s are either deleted or replaced by letters from . To preserve privacy, we require separator replacements not to reinstate sensitive patterns. To preserve data utility, we favor separator replacements that have a small cost in terms of occurrences of -ghosts (patterns with frequency less than in and at least in and incur a bounded level of distortion in , as defined below. This is the MCSR problem, a restricted version of which is presented in Problem 3. The restricted version is referred to as and differs from MCSR in that it uses for the pattern length instead of an arbitrary value . is presented next for simplicity and because it is used in the proof of Lemma 5 (see the appendix for the proof). Lemma 5 implies Theorem 2.3.

Problem 3 ()

Given a string over an alphabet with occurrences of letter , and parameters and , construct a new string by substituting the occurrences of in with letters from , such that:

(I) is minimum, and (II) .

The cost of -ghosts is captured by a function Ghost. This function assigns a cost to an occurrence of a , which is caused by a separator replacement at position , and is specified based on domain knowledge. For example, with a cost equal to for each gained occurrence of each , we penalize more heavily a -ghost with frequency much below in and the penalty increases with the number of gained occurrences. Moreover, we may want to penalize positions towards the end of a temporally ordered string, to avoid spurious patterns that would be deemed important in applications based on time-decaying models [8].

The replacement distortion is captured by a function Sub which assigns a weight to a letter that could replace a and is specified based on domain knowledge. The maximum allowable replacement distortion is . Small weights favor the replacement of separators with desirable letters (e.g., letters that reinstate non-sensitive frequent patterns) and letters that reinstate sensitive patterns are assigned a weight larger than that prohibits them from replacing a . Similarly, weights larger than are assigned to letters which would lead to implausible patterns [14] if they replaced s.

Lemma 5

The problem is NP-hard.

See 2.3

MCSR-ALGO. Our MCSR-ALGO is a non-trivial heuristic that exploits the connection of the MCSR and MCK [20] problems and works by:
(I) Constructing the set of all candidate -ghost patterns (i.e., length- strings over with frequency below in that can have frequency at least in ).
(II) Creating an instance of MCK from an instance of MCSR. For this, we map the th occurrence of to a class in MCK and each possible replacement of the occurrence with a letter to a different item in . Specifically, we consider all possible replacements with letters in and also a replacement with the empty string, which models deleting (instead of replacing) the th occurrence of . In addition, we set the costs and weights that are input to MCK as follows. The cost for replacing the th occurrence of with the letter is set to the sum of the Ghost function for all candidate -ghost patterns when the th occurrence of is replaced by . That is, we make the worst-case assumption that the replacement forces all candidate -ghosts to become -ghosts in . The weight for replacing the th occurrence of with letter is set to .
(III) Solving the instance of MCK and translating the solution back to a (possibly suboptimal) solution of the MCSR problem. For this, we replace the th occurrence of with the letter corresponding to the element chosen by the MCK algorithm from class , and similarly for each other occurrence of . If the instance has no solution (i.e., no possible replacement can hide the sensitive patterns), MCSR-ALGO reports that cannot be constructed and terminates.

Lemma 6 below states the running time of MCSR-ALGO (see the appendix for the proof on an efficient implementation of this algorithm).

Lemma 6

MCSR-ALGO runs in time, where is the running time of the MCK algorithm for classes with elements each.

6 Experimental Evaluation

We evaluate our approach, referred to as TPM, in terms of data utility and efficiency. Given a string over , TPM sanitizes by applying TFS-ALGO, PFS-ALGO, and then MCSR-ALGO, which uses the -time algorithm of [20] for solving the MCK instances. The final output is a string over .

Experimental Setup and Data. We do not compare TPM against existing methods, because they are not alternatives to our approach (see Section 7). Instead, we compared against a greedy baseline referred to as BA.

BA initializes its output string to and then considers each sensitive pattern in , from left to right. For each , it replaces the letter of that has the largest frequency in with another letter that is not contained in and has the smallest frequency in , breaking all ties arbitrarily. If no such exists, is replaced by to ensure that a solution is produced (even if it may reveal the location of a sensitive pattern). Each replacement removes the occurrence of and aims to prevent -ghost occurrences by selecting an that will not substantially increase the frequency of patterns overlapping with .

We considered the following publicly available datasets used in [1, 12, 14, 16]: Oldenburg (OLD), Trucks (TRU), MSNBC (MSN), the complete genome of Escherichia coli (DNA), and synthetic data (uniformly random strings, the largest of which is referred to as SYN). See Table 1 for the characteristics of these datasets and the parameter values used in experiments, unless stated otherwise.

Dataset Data domain Length Alphabet # sensitive # sensitive Pattern
size patterns positions length
OLD Movement 85,563 100        
TRU Transportation 5,763 100        
MSN Web 4,698,764 17        
DNA Genomic 4,641,652 4     
SYN Synthetic 20,000,000 10         
Table 1: Characteristics of datasets and values used (default values are in bold).

The sensitive patterns were selected randomly among the frequent length- substrings at minimum support following [12, 14, 16]. We used the fairly low values , , , and for TRU, OLD, MSN, and DNA, respectively, to have a wider selection of sensitive patterns. We also used a uniform cost of for every occurrence of each -ghost, a weight of (resp., ) for each letter replacement that does not (resp., does) create a sensitive pattern, and we further set . This setup treats all candidate -ghost patterns and all candidate letters for replacement uniformly, to facilitate a fair comparison with BA which cannot distinguish between -ghost candidates or favor specific letters.

To capture the utility of sanitized data, we used the (frequency) distortion measure , where is a non-sensitive pattern. The distortion measure quantifies changes in the frequency of non-sensitive patterns with low values suggesting that remains useful for tasks based on pattern frequency (e.g., identifying motifs corresponding to functional or conserved DNA [21]). We also measured the number of -ghost and -lost patterns in following [12, 14, 16], where a pattern is in if and only if but . That is, -lost patterns model knowledge that can no longer be mined from but could be mined from , whereas -ghost patterns model knowledge that can be mined from but not from . A small number of -lost/ghost patterns suggests that frequent pattern mining can be accurately performed on  [12, 14, 16]. Unlike BA, by design TPM does not incur any -lost pattern, as TFS-ALGO and PFS-ALGO preserve frequencies of nonsensitive patterns, and MCSR-ALGO can only increase pattern frequencies.

All experiments ran on an Intel Xeon E5-2640 at 2.66GHz with 16GB RAM. Our source code, written in C++, is available at https://bitbucket.org/stringsanitization. The results have been averaged over runs.

(a) OLD
(b) TRU
(c) MSN
(d) DNA
Figure 2: Distortion vs. number of sensitive patterns and their total number of occurrences in (first two lines on the axis).
(a) OLD
(b) TRU
(c) MSN
(d) DNA
Figure 3: Distortion vs. length of sensitive patterns (and ).

Data Utility. We first demonstrate that TPM incurs very low distortion, which implies high utility for tasks based on the frequency of patterns (e.g., [21]). Fig. 2 shows that, for varying number of sensitive patterns, TPM incurred on average (and up to ) times lower distortion than BA over all experiments. Also, Fig. 2 shows that TPM remains effective even in challenging settings, with many sensitive patterns (e.g., the last point in Fig. 1(b) where about of the positions in are sensitive). Fig. 3 shows that, for varying , TPM caused on average (and up to ) times lower distortion than BA over all experiments.

Next, we demonstrate that TPM permits accurate frequent pattern mining: Fig. 4 shows that TPM led to no -lost or -ghost patterns for the TRU and MSN datasets. This implies no utility loss for mining frequent length- substrings with threshold . In all other cases, the number of -ghosts was on average (and up to ) times smaller than the total number of -lost and -ghost patterns for BA. BA performed poorly (e.g., up to of frequent patterns became -lost for TRU and for DNA). Fig. 5 shows that, for varying , TPM led to on average (and up to ) times fewer -lost/ghost patterns than BA. BA performed poorly (e.g., up to of frequent patterns became -lost for DNA).

(a) OLD
(b) TRU
(c) MSN
(d) DNA
Figure 4: Total number of -lost and -ghost patterns vs. number of sensitive patterns (and ). on the top of each bar for BA denotes -lost and -ghost patterns.
(a) OLD
(b) TRU
(c) MSN
(d) DNA
Figure 5: Total number of -lost and -ghost patterns vs. length of sensitive patterns (and ). on the top of each bar for BA denotes -lost and -ghost patterns.

We also demonstrate that PFS-ALGO reduces the length of the output string of TFS-ALGO substantially, creating a string that contains less redundant information and allows for more efficient analysis. Fig. 5(a) shows the length of and of and their difference for . was much shorter than and its length decreased with the number of sensitive patterns, since more substrings had a suffix-prefix overlap of length and were removed (see Section 4). Interestingly, the length of was close to that of (the string before sanitization). A larger led to less substantial length reduction as shown in Fig. 5(b) (but still few thousand letters were removed), since it is less likely for long substrings of sensitive patterns to have an overlap and be removed.

(a) DNA
(b) DNA
(c) Substr. of SYN
(d) SYN
Figure 6: Length of and (output of TFS-ALGO and PFS-ALGO, resp.) for varying: (a) number of sensitive patterns (and ), (b) length of sensitive patterns (and ). On the top of each pair of bars we plot . Runtime on synthetic data for varying: (c) length of string and (d) length of sensitive patterns. Note that .

Efficiency. We finally measured the runtime of TPM using prefixes of the synthetic string SYN whose length is million letters. Fig. 5(c) (resp., Fig. 5(d)) shows that TPM scaled linearly with (resp., ), as predicted by our analysis in Section 5 (TPM takes time, since the algorithm of [20] was used for MCK instances). In addition, TPM is efficient, with a runtime similar to that of BA and less than seconds for SYN.

7 Related Work

Data sanitization (a.k.a. knowledge hiding) aims at concealing patterns modeling confidential knowledge by limiting their frequency, so that they are not easily mined from the data. Existing methods are applied to: (I) a collection of set-valued data (transactions) [24] or spatiotemporal data (trajectories) [1]; (II) a collection of sequences [12, 14]; or (III) a single sequence [4, 16, 25]. Yet, none of these methods follows our CSD setting: Methods in category I are not applicable to string data, and those in categories II and III do not have guarantees on privacy-related constraints [25] or on utility-related properties [12, 14, 4, 16]. Specifically, unlike our approach, [25] cannot guarantee that all sensitive patterns are concealed (constraint C1), while [12, 14, 4, 16] do not guarantee the satisfaction of utility properties (e.g., and P2).

Anonymization aims to prevent the disclosure of individuals’ identity and/or information that individuals are not willing to be associated with [3, 10]. Anonymization works such as [3, 5, 7] are thus not alternatives to our work (see the appendix).

8 Conclusion

In this paper, we introduced the Combinatorial String Dissemination model. The focus of this model is on guaranteeing privacy-utility trade-offs (e.g., C1 vs. and P2). We defined a problem (TFS) which seeks to produce the shortest string that preserves the order of appearance and the frequency of all non-sensitive patterns; and a variant (PFS) that preserves a partial order and the frequency of the non-sensitive patterns but produces a shorter string. We developed two time-optimal algorithms, TFS-ALGO and PFS-ALGO, for the problem and its variant, respectively. We also developed MCSR-ALGO, a heuristic that prevents the disclosure of the location of sensitive patterns from the outputs of TFS-ALGO and PFS-ALGO. Our experiments show that sanitizing a string by TFS-ALGO, PFS-ALGO and then MCSR-ALGO is effective and efficient.

Acknowledgments.

HC is supported by a CSC scholarship. GR and NP are partially supported by MIUR-SIR project CMACBioSeq grant n. RBSI146R5L. We acknowledge the use of the Rosalind HPC cluster hosted by King’s College London.

References

  • [1] Abul, O., Bonchi, F., Giannotti, F.: Hiding sequential and spatiotemporal patterns. TKDE 22(12), 1709–1723 (2010)
  • [2] Aggarwal, C.C., Yu, P.S.: On anonymization of string data. In: SDM. pp. 419–424 (2007)
  • [3] Aggarwal, C.C., Yu, P.S.: A framework for condensation-based anonymization of string data. DMKD 16(3), 251–275 (2008)
  • [4] Bonomi, L., Fan, L., Jin, H.: An information-theoretic approach to individual sequential data sanitization. In: WSDM. pp. 337–346 (2016)
  • [5] Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: CIKM. pp. 269–278 (2013)
  • [6] Cazaux, B., Lecroq, T., Rivals, E.: Linking indexing data structures to de Bruijn graphs: Construction and update. J. Comput. Syst. Sci. (2016)
  • [7]

    Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length n-grams. In: CCS. pp. 638–649 (2012)

  • [8] Cormode, G., Korn, F., Tirthapura, S.: Exponentially decayed aggregates on data streams. In: ICDE. pp. 1379–1381 (2008)
  • [9] Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on strings. Cambridge University Press (2007)
  • [10] Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC. pp. 265–284 (2006)
  • [11] Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20(1), 50–58 (1980)
  • [12] Gkoulalas-Divanis, A., Loukides, G.: Revisiting sequential pattern hiding to enhance utility. In: KDD. pp. 1316–1324 (2011)
  • [13] Grossi, R., Iliopoulos, C.S., Mercas, R., Pisanti, N., Pissis, S.P., Retha, A., Vayani, F.: Circular sequence comparison: algorithms and applications. AMB 11,  12 (2016)
  • [14] Gwadera, R., Gkoulalas-Divanis, A., Loukides, G.: Permutation-based sequential pattern hiding. In: ICDM. pp. 241–250 (2013)
  • [15] Liu, A., Zhengy, K., Liz, L., Liu, G., Zhao, L., Zhou, X.: Efficient secure similarity computation on encrypted trajectory data. In: ICDE. pp. 66–77 (2015)
  • [16] Loukides, G., Gwadera, R.: Optimal event sequence sanitization. In: SDM. pp. 775–783 (2015)
  • [17] Malin, B., Sweeney, L.: Determining the identifiability of DNA database entries. In: AMIA. pp. 537–541 (2000)
  • [18] Monreale, A., Pedreschi, D., Pensa, R.G., Pinelli, F.: Anonymity preserving sequential pattern mining. Artif. Intell. Law 22(2), 141–173 (2014)
  • [19] Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: S&P. pp. 111–125 (2008)
  • [20] Pissinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur J Oper Res 83(2), 394–410 (1995)
  • [21] Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinformatics 15,  235 (2014)
  • [22] Sinha, P., Zoltners, A.A.: The multiple-choice knapsack problem. Operations Research 27(3), 431–627 (1979)
  • [23] Theodorakopoulos, G., Shokri, R., Troncoso, C., Hubaux, J., Boudec, J.L.: Prolonging the hide-and-seek game: Optimal trajectory privacy for location-based services. In: WPES. pp. 73–82 (2014)
  • [24] Verykios, V.S., Elmagarmid, A.K., Bertino, E., Saygin, Y., Dasseni, E.: Association rule hiding. TKDE 16(4), 434–447 (2004)
  • [25] Wang, D., He, Y., Rundensteiner, E., Naughton, J.F.: Utility-maximizing event stream suppression. In: SIGMOD. pp. 589–600 (2013)

Appendix 0.A Omitted Proofs

Proof (Lemma 1)

Index in TFS-ALGO runs over the positions of string

; at any moment it indicates the ending position of the currently considered length-

substring of . When (Lines 1-1) TFS-ALGO never appends , i.e.,  the last letter of a sensitive length- substring, implying that, by construction of , no with occurs in .

When (Lines 1-1) TFS-ALGO appends to , thus the order of and is clearly preserved. When and , index stores the starting position on of the -length suffix of the last non-sensitive substring appended to (see also Fig. 1). C1 ensures that no sensitive substring is added to in this case, nor when . The next letter will thus be appended to when and (Lines 1-1). The condition on Line 1 is satisfied if and only if the last non-sensitive length- substring appended to overlaps with the immediately succeeding non-sensitive one by letters: in this case, the last letter of the latter is appended to by Line 1, clearly maintaining the order of the two. Otherwise, Line 1 will append to , once again maintaining the length- substrings’ order. Conversely, by construction, any occurs in only if it equals a length- non-sensitive substring of . The only occasion when a letter from is appended to more then once is when Line 1 is executed: it is easy to see that in this case, because of the occurrence of , each of the repeated letters creates exactly one , without introducing any new length- string over nor increasing the occurrences of a previous one. Finally, Line 1 does not introduce any new except for the one present in , nor any extra occurrence of the latter, because it is only executed when two consecutive non-sensitive length- substrings of overlap exactly by letters.

It follows from the proof for C1 and P1.

Letter is added only by Line 1, which is executed only when and . This can be the case up to times as array can have alternate values only in the first positions. By construction, cannot start with (Lines 1-1), and thus the maximal number of occurrences of is . By construction, letter in is followed by at least letters (Line 1): the leftmost non-sensitive substring following a sequence of one or more occurrences of sensitive substrings in .

Upper bound. TFS-ALGO increases the length of string by more than one letter only when letter is added to (Line 1). Every time Lines 1-1 are executed, the length of increases by letters. Thus the length of is maximized when the maximal number of occurrences of is attained. This length is thus bounded by .

Tightness. For the lower bound, let and be sensitive. The condition at Line 1 is not satisfied because no element in is set to 0: . Then the condition on Line 1 is also not satisfied because , and thus TFS-ALGO outputs the empty string. A de Bruijn sequence of order over an alphabet is a string in which every possible length- string over occurs exactly once as a substring. For the upper bound, let be the order- de Bruijn sequence over alphabet , be even, and . and so Line 1 will add the first letters of to . Then observe that , and so on; this sequence of values corresponds to satisfying Lines 1 and 1 alternately. Line 1 does not add any letter to . The if statement on Line 1 will always fail because of the de Bruijn sequence property. We thus have a sequence of the non-sensitive length- substrings of interleaved by occurrences of appended to . TFS-ALGO thus outputs a string of length (see Example 4).

Example 4 (Illustration of P3)

Let . We construct the order- de Bruijn sequence of length over alphabet , and choose . TFS-ALGO constructs:

The upper bound of on the length of is attained.∎

Proof (Lemma 3)

Consider the following renaming technique. Each length- substring of the collection is assigned a lexicographic rank from the range . Each string in is converted to a two-letter string as follows. The first letter is the lexicographic rank of its length- prefix and the second letter is the lexicographic rank of its length- suffix. We thus obtain a new collection of two-letter strings. Computing the ranks for all length- substrings in can be implemented in time by employing radixsort to sort and then the well-known LCP data structure over the concatenation of strings in  [9]. The FO-SSM problem is thus solved by finding a shortest string containing every element of as a distinct substring. Since consists of two-letter strings only we can solve the problem in time by applying Lemma 2. The statement follows.∎

Proof (Lemma 4)

C1 and P2 hold trivially for as no length- substring over is added or removed from . Let . The order of non-sensitive length- substrings within , for all , is preserved in . Thus for any p-chain of , there is a p-chain of such that ( is preserved). P3 also holds trivially for as no occurrence of is added. Since , for P4, it suffices to note that the construction of in the proof of tightness in Lemma 1 (see also Example 4) ensures that there is no suffix-prefix overlap of length between any pair of length- substrings of over due to the property of the order- de Bruijn sequence. Thus the upper bound of on the length of is also tight for .

The minimality on the length of follows from the minimality of and the correctness of Lemma 3 that computes a shortest such string. ∎

Proof (Lemma 5)

We reduce the NP-hard Multiple Choice Knapsack (MCK) problem [22] to in polynomial time. In MCK, we are given a set of elements subdivided into , mutually exclusive classes, , and a knapsack. Each class has elements. Each element has an arbitrary cost and an arbitrary weight . The goal is to minimize the total cost (Eq. 1) by filling the knapsack with one element from each class (constraint II), such that the weights of the elements in the knapsack satisfy constraint I, where constant represents the minimum allowable total weight of the elements in the knapsack:

(1)

subject to the constraints: (I) , (II), and  (III) .

The variable takes value if the element is chosen from class , otherwise (constraint III). We reduce any instance to an instance in polynomial time, as follows:

  • Alphabet consists of letters , for each and each class , .

  • We set . Every element of occurs exactly once: . Letter occurs times in . For convenience, let us denote by the th occurrence of in .

  • We set and .

  • and . The functions are otherwise not defined.

This is clearly a polynomial-time reduction. We now prove the correspondence between a solution to the given instance and a solution to the instance .

We first show that if is a solution to , then is a solution to . Since the elements in have minimum ,