1 Introduction
A large number of applications, in domains ranging from transportation to web analytics and bioinformatics feature data modeled as strings, i.e., sequences of letters over some finite alphabet. For instance, a string may represent the history of visited locations of one or more individuals, with each letter corresponding to a location. Similarly, it may represent the history of search query terms of one or more web users, with letters corresponding to query terms, or a medically important part of the DNA sequence of a patient, with letters corresponding to DNA bases. Analyzing such strings is key in applications including locationbased service provision, product recommendation, and DNA sequence analysis. Therefore, such strings are often disseminated beyond the party that has collected them. For example, locationbased service providers often outsource their data to data analytics companies who perform tasks such as similarity evaluation between strings [15], and retailers outsource their data to marketing agencies who perform tasks such as mining frequent patterns from the strings [16].
However, disseminating a string intact may result in the exposure of confidential knowledge, such as trips to mental health clinics in transportation data [23], query terms revealing political beliefs or sexual orientation of individuals in web data [19], or diseases associated with certain parts of DNA data [17]. Thus, it may be necessary to sanitize a string prior to its dissemination, so that confidential knowledge is not exposed. At the same time, it is important to preserve the utility of the sanitized string, so that data protection does not outweigh the benefits of disseminating the string to the party that disseminates or analyzes the string, or to the society at large. For example, a retailer should still be able to obtain actionable knowledge in the form of frequent patterns from the marketing agency who analyzed their outsourced data; and researchers should still be able to perform analyses such as identifying significant patterns in DNA sequences.
Our Model and Setting. Motivated by the discussion above, we introduce the following model which we call Combinatorial String Dissemination (CSD). In CSD, a party has a string that it seeks to disseminate, while satisfying a set of constraints and a set of desirable properties. For instance, the constraints aim to capture privacy requirements and the properties aim to capture data utility considerations (e.g., posed by some other party based on applications). To satisfy both, must be transformed to a string by applying a sequence of edit operations. The computational task is to determine this sequence of edit operations so that satisfies the desirable properties subject to the constraints.
Under the CSD model, we consider a specific setting in which the sanitized string must satisfy the following constraint C1: for an integer , no given length substring (also called pattern) modeling confidential knowledge should occur in . We call each such length substring a sensitive pattern. We aim at finding the shortest possible string satisfying the following desired properties: (P1) the order of appearance of all other length substrings (nonsensitive patterns) is the same in and in ; and (P2) the frequency of these length substrings is the same in and in . The problem of constructing in this setting is referred to as TFS (Total order, Frequency, Sanitization). Clearly, substrings of arbitrary lengths can be hidden from by setting equal to the length of the shortest substring we wish to hide, and then setting, for each of these substrings, any length substring as sensitive.
Our setting is motivated by realworld applications involving string dissemination. In these applications, a data custodian disseminates the sanitized version of a string to a data recipient, for the purpose of analysis (e.g., mining). contains confidential information that the data custodian needs to hide, so that it does not occur in . Such information is specified by the data custodian based on domain expertise, as in [1, 4, 12, 16]. At the same time, the data recipient specifies P1 and P2 that must satisfy in order to be useful. These properties map directly to common data utility considerations in string analysis. By satisfying P1, allows tasks based on the sequential nature of the string, such as blockwise gram distance computation [13], to be performed accurately. By satisfying P2, allows computing the frequency of length substrings [21] and hence mining frequent length substrings with no utility loss. We require that has minimal length so that it does not contain redundant information. For instance, the string which is constructed by concatenating all nonsensitive length substrings in and separating them with a special letter that does not occur in , satisfies P1 and P2 but is not the shortest possible. Such a string will have a negative impact on the efficiency of any subsequent analysis tasks to be performed on it.
Note, existing works for sequential data sanitization (e.g., [4, 12, 14, 16, 25]) or anonymization (e.g., [3, 5, 7]) cannot be applied to our setting (see Section 7).
Our Contributions. We define the TFS problem for string sanitization and a variant of it, referred to as PFS (Partial order, Frequency, Sanitization), which aims at producing an even shorter string by relaxing P1 of TFS. Our algorithms for TFS and PFS construct strings and using a separator letter , which is not contained in the alphabet of . This prevents occurrences of sensitive patterns in or . The algorithms repeat proper substrings of sensitive patterns so that the frequency of nonsensitive patterns overlapping with sensitive ones does not change. For , we give a deterministic construction which may be easily reversible (i.e., it may enable a data recipient to construct from ), because the occurrences of reveal the exact location of sensitive patterns. For , we give a construction which breaks several ties arbitrarily, thus being less easily reversible. We further address the reversibility issue by defining the MCSR (MinimumCost Separators Replacement) problem and designing an algorithm for dealing with it. In MCSR, we seek to replace all separators, so that the location of sensitive patterns is not revealed, while preserving data utility. We make the following specific contributions:
1. We design an algorithm for solving the TFS problem in time, where is the length of . In fact we prove that time is worstcase optimal by showing that the length of is in in the worst case. The output of the algorithm is a string consisting of a sequence of substrings over the alphabet of separated by (see Example 1 below). An important feature of our algorithm, which is useful in the efficient construction of discussed next, is that it can be implemented to produce an sized representation of with respect to in time. See Section 3.
Example 1
Let , , and the set of sensitive patterns be . The string consists of three substrings over the alphabet separated by . Note that no sensitive pattern occurs in , while all nonsensitive substrings of length have the same frequency in and in (e.g., aaba appears twice) and appear in the same order in and in (e.g., babb precedes abbb). Also, note that any shorter string than would either create sensitive patterns or change the frequencies (e.g., removing the last letter of creates a string in which baab no longer appears).∎
2. We define the PFS problem relaxing P1 of TFS to produce shorter strings that are more efficient to analyze. Instead of a total order (P1), we require a partial order () that preserves the order of appearance only for sequences of successive nonsensitive length substrings that overlap by letters. This makes sense because the order of two successive nonsensitive length substrings with no length overlap has anyway been “interrupted” (by a sensitive pattern). We exploit this observation to shorten the string further. Specifically, we design an algorithm that solves PFS in the optimal time, where is the length of , using the sized representation of . See Section 4.
Example 2
(Cont’d from Example 1) Recall that . A string is aaababbba#aabaab. The order of babb and abbb is preserved in since they are successive, nonsensitive, and with an overlap of letters. The order of abaa and aaab, which are successive and nonsensitive, is not preserved since they do not have an overlap of letters.∎
3. We define the MCSR problem, which seeks to produce a string , by deleting or replacing all separators in with letters from the alphabet of so that: no sensitive patterns are reinstated in ; occurrences of spurious patterns that may not be mined from but can be mined from , for a given support threshold, are prevented; the distortion incurred by the replacements in is bounded. The first requirement is to preserve privacy and the next two to preserve data utility. We show that MCSR is NPhard and propose a heuristic to attack it. See Section 5.
4. We implemented our combinatorial approach for sanitizing a string (i.e., all aforementioned algorithms implementing the pipeline ) and show its effectiveness and efficiency on real and synthetic data. See Section 6.
2 Preliminaries, Problem Statements, and Main Results
Preliminaries. Let be a string of length over a finite ordered alphabet of size . By we denote the set of all strings over . By we denote the set of all length strings over . For two positions and on , we denote by the substring of that starts at position and ends at position of . By we denote the empty string of length 0. A prefix of is a substring of the form , and a suffix of is a substring of the form . A proper prefix (suffix) of a string is not equal to the string itself. By we denote the number of occurrences of string in string . Given two strings and we say that has a suffixprefix overlap of length with if and only if the length suffix of is equal to the length prefix of , i.e., .
We fix a string of length over an alphabet and an integer . We refer to a length string or a pattern interchangeably. An occurrence of a pattern is uniquely represented by its starting position. Let be a set of positions over with the following closure property: for every , if there exists such that , then . That is, if an occurrence of a pattern is in all its occurrences are in . A substring of is called sensitive if and only if . is thus the set of occurrences of sensitive patterns. The difference set is the set of occurrences of nonsensitive patterns.
For any string , we denote by the set of occurrences of nonsensitive length strings over . (We have that .) We call an occurrence the tpredecessor of another occurrence in if and only if is the largest element in that is less than . This relation induces a strict total order on the occurrences in . We call the ppredecessor of in if and only if is the tpredecessor of in and has a suffixprefix overlap of length with . This relation induces a strict partial order on the occurrences in . We call a subset of a tchain (resp., pchain) if for all elements in except the minimum one, their tpredecessor (resp., ppredecessor) is also in . For two strings and , chains and are equivalent, denoted by , if and only if and , where is the th smallest element of and is the th smallest of , for all .
Problem Statements and Main Results.
Problem 1 (Tfs)
Given , , , and construct the shortest string :
 C1

does not contain any sensitive pattern.
 P1

, i.e., the tchains and are equivalent.
 P2

, for all .
TFS requires constructing the shortest string in which all sensitive patterns from are concealed (C1), while preserving the order (P1) and the frequency (P2) of all nonsensitive patterns. Our first result is the following.
Theorem 2.1 ()
Let be a string of length over . Given and , TFSALGO solves Problem 1 in time, which is worstcase optimal. An sized representation of can be built in time.
P1 implies P2, but P1 is a strong assumption that may result in long output strings that are inefficient to analyze. We thus relax P1 to require that the order of appearance remains the same only for sequences of successive nonsensitive length substrings that also overlap by letters (pchains).
Problem 2 (Pfs)
Given , , , and construct a shortest string :
 C1

does not contain any sensitive pattern.
 1

For any pchain of , there is a pchain of such that .
 P2

, for all .
Our second result, which builds on Theorem 2.1, is the following.
Theorem 2.2 ()
Let be a string of length over . Given and , PFSALGO solves Problem 2 in the optimal time.
To arrive at Theorems 2.1 and 2.2, we use a special letter (separator) when required. However, the occurrences of may reveal the locations of sensitive patterns. We thus seek to delete or replace the occurrences of in with letters from . The new string should not reinstate any sensitive pattern. Given an integer threshold , we call pattern a in if and only if but . Moreover, we seek to prevent occurrences in by also bounding the total weight of the letter choices we make to replace the occurrences of . This is the MCSR problem. We show that already a restricted version of the MCSR problem, namely, the version when , is NPhard via the Multiple Choice Knapsack (MCK) problem [20].
Theorem 2.3 ()
The MCSR problem is NPhard.
Based on this connection, we propose a nontrivial heuristic algorithm to attack the MCSR problem for the general case of an arbitrary .
3 TfsAlgo
We convert string into a string over alphabet , , by reading the letters of , from left to right, and appending them to while enforcing the following two rules:
R1: When the last letter of a sensitive substring is read from , we append to (essentially replacing this last letter of with ). Then, if is the longest proper prefix of the succeeding nonsensitive substring (in the tpredecessor order), we append the longest proper suffix of right after .
R2: When the letters before are the same as the letters after , we remove and the succeeding letters (inspect Fig. 1).
R1 prevents from occurring in , and R2 reduces the length of (i.e., allows to protect sensitive patterns with fewer extra letters). Both rules leave unchanged the order and frequencies of nonsensitive patterns. It is crucial to observe that applying the idea behind R2 on more than letters would decrease the frequency of some pattern, while applying it on fewer than letters would create new patterns. Thus, we need to consider just R2 asis.
Let be an array of size that stores the occurrences of sensitive and nonsensitive patterns: if and if . For technical reasons we set the last values in equal to ; i.e., . Note that is constructible from in time. Given and , TFSALGO efficiently constructs by implementing R1 and R2 concurrently as opposed to implementing R1 and then R2 (see the proof of Lemma 1 for details of the workings of TFSALGO and Fig. 1 for an example). We next show that string enjoys several properties.
Lemma 1
Let be a string of length over . Given and array , TFSALGO constructs the shortest string such that the following hold:

There exists no with occurring in (C1).

, i.e., the order of substrings , for all such that , is the same in and in ; conversely, the order of all substrings of is the same in and in (P1).

, for all (P2).

The occurrences of letter in are at most and they are at least positions apart (P3).

and these bounds are tight (P4).
Proof
Proofs of C1 and P1P4 can be found in the appendix. We prove here that has minimal length. Let be the prefix of string obtained by processing the first letters of string . Let . We will proceed by induction on , claiming that is the shortest string such that C1 and P1P4 hold for . We call such a string optimal.
Base case: . By Lines 11 of TFSALGO, is equal to the first nonsensitive length substring of , and it is clearly the shortest string such that C1 and P1P4 hold for .
Inductive hypothesis and step: is optimal for . If , and this is clearly optimal. If , thus still optimal. Finally, if and we have two subcases: if then , and once again is evidently optimal. Otherwise, . Suppose by contradiction that there exists a shorter such that C1 and P1P4 still hold: either drop or append less than letters after . If we appended less than letters after , since TFSALGO will not read ever again, P2P3 would be violated, as an occurrence of would be missed. Without , the last letters of would violate either C1 or P1 and P2 (since we suppose ). Then is optimal.∎
See 2.1
Proof
For the first part inspect TFSALGO. Lines 11 can be realized in time. The while loop in Line 1 is executed no more than times, and every operation inside the loop takes time except for Line 1 and Line 1 which take time. Correctness and optimality follow directly from Lemma 1 (P4).
For the second part, we assume that is represented by and a sequence of pointers to interleaved (if necessary) by occurrences of . In Line 1, we can use an interval to represent the length substring of added to . In all other lines (Lines 1, 1 and 1) we can use as one letter is added to per one letter of . By Lemma 1 we can have at most occurrences of letter . The check at Line 1 can be implemented in constant time after lineartime preprocessing of for longest common extension queries [9]. All other operations take in total linear time in . Thus there exists an sized representation of and it is constructible in time.∎
4 PfsAlgo
Lemma 1 tells us that is the shortest string satisfying constraint C1 and properties P1P4. If we were to drop P1 and employ the partial order (see Problem 2), the length of would not always be minimal: if a permutation of the strings contains pairs , with a suffixprefix overlap of length , we may further apply R2, obtaining a shorter string while still satisfying .
To find such a permutation efficiently and construct a shorter string from , we propose PFSALGO. The crux of our algorithm is an efficient method to solve a variant of the classic NPcomplete Shortest Common Superstring (SCS) problem [11]. Specifically our algorithm: (I) Computes the string using Theorem 2.1. (II) Constructs a collection of strings, each of two symbols (two identifiers): the first (resp., second) symbol of the th element of is a unique identifier of the string corresponding to the length prefix (resp., suffix) of the th element of . (III) Computes a shortest string containing every element in as a distinct substring. (IV) Constructs by mapping back each element to its distinct substring in . If there are multiple possible shortest strings, one is selected arbitrarily.
Example 3 (Illustration of the workings of PfsAlgo)
Let and
The collection is , and the collection is (id of prefix id of suffix). A shortest string containing all elements of as distinct substrings is: (obtained by permuting the original string as then applying R2 twice). This shortest string is mapped back to the solution . For example, 13 is mapped back to aaabac. Note, contains two occurrences of and has length , while contains occurrences of and has length .∎
We now present the details of PFSALGO. We first introduce the FixedOverlap Shortest String with Multiplicities (FOSSM) problem: Given a collection of strings and an integer , with , for all , FOSSM seeks to find a shortest string containing each element of as a distinct substring using the following operations on any pair of strings :

;

.
Any solution to FOSSM with and implies a solution to the PFS problem, because for all ’s (see Lemma 1, P3)
The FOSSM problem is a variant of the SCS problem. In the SCS problem, we are given a set of strings and we are asked to compute the shortest common superstring of the elements of this set. The SCS problem is known to be NPComplete, even for binary strings [11]. However, if all strings are of length two, the SCS problem admits a lineartime solution [11]. We exploit this crucial detail positively to show a lineartime solution to the FOSSM problem in Lemma 3. In order to arrive to this result, we first adapt the SCS lineartime solution of [11] to our needs (see Lemma 2) and plug this solution to Lemma 3.
Lemma 2
Let be a collection of strings, each of length two, over an alphabet . We can compute a shortest string containing every element of as a distinct substring in time.
Proof
We sort the elements of lexicographically in time using radixsort. We also replace every letter in these strings with their lexicographic rank from in time using radixsort. In time we construct the de Bruijn multigraph of these strings [6]. Within the same time complexity, we find all nodes in with indegree, denoted by , smaller than outdegree, denoted by . We perform the following two steps:
Step 1: While there exists a node in with , we start an arbitrary path (with possibly repeated nodes) from ,
traverse consecutive edges and delete them. Each time we delete an edge, we update the in and outdegree of the affected nodes.
We stop traversing edges when a node with is reached: whenever , we also delete from . Then, we add the traversed path to a set of paths. The path can contain the same node more than once. If is empty we halt. Proceeding this way, there are no two elements and in such that starts with and ends with ; thus this path decomposition is minimal. If is not empty at the end, by construction, it consists of only cycles.
Step 2: While is not empty, we perform the following. If there exists a cycle that intersects with any path in we splice with , update with the result of splicing, and delete from . This operation can be efficiently implemented by maintaining an array of size of linked lists over the paths in : stores a list of pointers to all occurrences of letter in the elements of . Thus in constant time per node of we check if any such path exists in and splice the two in this case. If no such path exists in , we add to any of the pathlinearizations of the cycle, and delete the cycle from . After each change to , we update and delete every node with from .
The correctness of this algorithm follows from the fact that is a minimal path decomposition of . Thus any concatenation of paths in represents a shortest string containing all elements in as distinct substrings. ∎
Lemma 3
Let be a collection of strings over an alphabet . Given an integer , the FOSSM problem for can be solved in time.
Thus, PFSALGO applies Lemma 3 on with (recall that ). Note that each time the concat operation is performed, it also places the letter in between the two strings.
Lemma 4
Let be a string of length over an alphabet . Given and array , PFSALGO constructs a shortest string with C1, , and P2P4.
See 2.2
Proof
We compute the sized representation of string with respect to described in the proof of Theorem 2.1. This can be done in time. If , then we construct and return in time from the representation. If , implying , we compute the LCP data structure of string in time [9]; and implement Lemma 3 in time by avoiding to read string explicitly: we rather rename to a collection of twoletter strings by employing the LCP information of directly. We then construct and report in time . Correctness follows directly from Lemma 4.∎
5 The Mcsr Problem and McsrAlgo
The strings and , constructed by TFSALGO and PFSALGO, respectively, may contain the separator , which reveals information about the location of the sensitive patterns in . Specifically, a malicious data recipient can go to the position of a in and “undo” Rule R1 that has been applied by TFSALGO, removing and the letters after from . The result will be an occurrence of the sensitive pattern. For example, applying this process to the first in shown in Fig. 1, results in recovering the sensitive pattern abab. A similar attack is possible on the string produced by PFSALGO, although it is hampered by the fact that substrings within two consecutive s in often swap places in .
To address this issue, we seek to construct a new string , in which s are either deleted or replaced by letters from . To preserve privacy, we require separator replacements not to reinstate sensitive patterns. To preserve data utility, we favor separator replacements that have a small cost in terms of occurrences of ghosts (patterns with frequency less than in and at least in and incur a bounded level of distortion in , as defined below. This is the MCSR problem, a restricted version of which is presented in Problem 3. The restricted version is referred to as and differs from MCSR in that it uses for the pattern length instead of an arbitrary value . is presented next for simplicity and because it is used in the proof of Lemma 5 (see the appendix for the proof). Lemma 5 implies Theorem 2.3.
Problem 3 ()
Given a string over an alphabet with occurrences of letter , and parameters and , construct a new string by substituting the occurrences of in with letters from , such that:

(I) is minimum, and (II) .
The cost of ghosts is captured by a function Ghost. This function assigns a cost to an occurrence of a , which is caused by a separator replacement at position , and is specified based on domain knowledge. For example, with a cost equal to for each gained occurrence of each , we penalize more heavily a ghost with frequency much below in and the penalty increases with the number of gained occurrences. Moreover, we may want to penalize positions towards the end of a temporally ordered string, to avoid spurious patterns that would be deemed important in applications based on timedecaying models [8].
The replacement distortion is captured by a function Sub which assigns a weight to a letter that could replace a and is specified based on domain knowledge. The maximum allowable replacement distortion is . Small weights favor the replacement of separators with desirable letters (e.g., letters that reinstate nonsensitive frequent patterns) and letters that reinstate sensitive patterns are assigned a weight larger than that prohibits them from replacing a . Similarly, weights larger than are assigned to letters which would lead to implausible patterns [14] if they replaced s.
Lemma 5
The problem is NPhard.
See 2.3
MCSRALGO. Our MCSRALGO is a nontrivial heuristic that exploits the connection of the MCSR and MCK [20] problems and works by:
(I) Constructing the set of all candidate ghost patterns (i.e., length strings over
with frequency below in that can have frequency at least in ).
(II) Creating an instance of MCK from an instance of MCSR. For this, we map the th occurrence of to a class in MCK and each possible replacement of the occurrence with a letter to a different item in . Specifically, we consider all possible replacements with letters in and also a replacement with the empty string, which models deleting (instead of replacing) the th occurrence of .
In addition, we set the costs and weights that are input to MCK as follows. The cost for replacing the th occurrence of with the letter is set to the sum of the Ghost function for all candidate ghost patterns when the th occurrence of is replaced by . That is, we make the worstcase assumption that the replacement forces all candidate ghosts to become ghosts in . The weight for replacing the th occurrence of with letter is set to .
(III) Solving the instance of MCK and translating the solution back to a (possibly suboptimal) solution of the MCSR problem. For this, we replace the th occurrence of with the letter corresponding to the element chosen by the MCK algorithm from class , and similarly for each other occurrence of . If the instance has no solution (i.e., no possible replacement can hide the sensitive patterns), MCSRALGO reports that cannot be constructed and terminates.
Lemma 6 below states the running time of MCSRALGO (see the appendix for the proof on an efficient implementation of this algorithm).
Lemma 6
MCSRALGO runs in time, where is the running time of the MCK algorithm for classes with elements each.
6 Experimental Evaluation
We evaluate our approach, referred to as TPM, in terms of data utility and efficiency. Given a string over , TPM sanitizes by applying TFSALGO, PFSALGO, and then MCSRALGO, which uses the time algorithm of [20] for solving the MCK instances. The final output is a string over .
Experimental Setup and Data. We do not compare TPM against existing methods, because they are not alternatives to our approach (see Section 7). Instead, we compared against a greedy baseline referred to as BA.
BA initializes its output string to and then considers each sensitive pattern in , from left to right. For each , it replaces the letter of that has the largest frequency in with another letter that is not contained in and has the smallest frequency in , breaking all ties arbitrarily. If no such exists, is replaced by to ensure that a solution is produced (even if it may reveal the location of a sensitive pattern). Each replacement removes the occurrence of and aims to prevent ghost occurrences by selecting an that will not substantially increase the frequency of patterns overlapping with .
We considered the following publicly available datasets used in [1, 12, 14, 16]: Oldenburg (OLD), Trucks (TRU), MSNBC (MSN), the complete genome of Escherichia coli (DNA), and synthetic data (uniformly random strings, the largest of which is referred to as SYN). See Table 1 for the characteristics of these datasets and the parameter values used in experiments, unless stated otherwise.
Dataset  Data domain  Length  Alphabet  # sensitive  # sensitive  Pattern 

size  patterns  positions  length  
OLD  Movement  85,563  100  
TRU  Transportation  5,763  100  
MSN  Web  4,698,764  17  
DNA  Genomic  4,641,652  4  
SYN  Synthetic  20,000,000  10 
The sensitive patterns were selected randomly among the frequent length substrings at minimum support following [12, 14, 16]. We used the fairly low values , , , and for TRU, OLD, MSN, and DNA, respectively, to have a wider selection of sensitive patterns. We also used a uniform cost of for every occurrence of each ghost, a weight of (resp., ) for each letter replacement that does not (resp., does) create a sensitive pattern, and we further set . This setup treats all candidate ghost patterns and all candidate letters for replacement uniformly, to facilitate a fair comparison with BA which cannot distinguish between ghost candidates or favor specific letters.
To capture the utility of sanitized data, we used the (frequency) distortion measure , where is a nonsensitive pattern. The distortion measure quantifies changes in the frequency of nonsensitive patterns with low values suggesting that remains useful for tasks based on pattern frequency (e.g., identifying motifs corresponding to functional or conserved DNA [21]). We also measured the number of ghost and lost patterns in following [12, 14, 16], where a pattern is in if and only if but . That is, lost patterns model knowledge that can no longer be mined from but could be mined from , whereas ghost patterns model knowledge that can be mined from but not from . A small number of lost/ghost patterns suggests that frequent pattern mining can be accurately performed on [12, 14, 16]. Unlike BA, by design TPM does not incur any lost pattern, as TFSALGO and PFSALGO preserve frequencies of nonsensitive patterns, and MCSRALGO can only increase pattern frequencies.
All experiments ran on an Intel Xeon E52640 at 2.66GHz with 16GB RAM. Our source code, written in C++, is available at https://bitbucket.org/stringsanitization. The results have been averaged over runs.
Data Utility. We first demonstrate that TPM incurs very low distortion, which implies high utility for tasks based on the frequency of patterns (e.g., [21]). Fig. 2 shows that, for varying number of sensitive patterns, TPM incurred on average (and up to ) times lower distortion than BA over all experiments. Also, Fig. 2 shows that TPM remains effective even in challenging settings, with many sensitive patterns (e.g., the last point in Fig. 1(b) where about of the positions in are sensitive). Fig. 3 shows that, for varying , TPM caused on average (and up to ) times lower distortion than BA over all experiments.
Next, we demonstrate that TPM permits accurate frequent pattern mining: Fig. 4 shows that TPM led to no lost or ghost patterns for the TRU and MSN datasets. This implies no utility loss for mining frequent length substrings with threshold . In all other cases, the number of ghosts was on average (and up to ) times smaller than the total number of lost and ghost patterns for BA. BA performed poorly (e.g., up to of frequent patterns became lost for TRU and for DNA). Fig. 5 shows that, for varying , TPM led to on average (and up to ) times fewer lost/ghost patterns than BA. BA performed poorly (e.g., up to of frequent patterns became lost for DNA).
We also demonstrate that PFSALGO reduces the length of the output string of TFSALGO substantially, creating a string that contains less redundant information and allows for more efficient analysis. Fig. 5(a) shows the length of and of and their difference for . was much shorter than and its length decreased with the number of sensitive patterns, since more substrings had a suffixprefix overlap of length and were removed (see Section 4). Interestingly, the length of was close to that of (the string before sanitization). A larger led to less substantial length reduction as shown in Fig. 5(b) (but still few thousand letters were removed), since it is less likely for long substrings of sensitive patterns to have an overlap and be removed.
Efficiency. We finally measured the runtime of TPM using prefixes of the synthetic string SYN whose length is million letters. Fig. 5(c) (resp., Fig. 5(d)) shows that TPM scaled linearly with (resp., ), as predicted by our analysis in Section 5 (TPM takes time, since the algorithm of [20] was used for MCK instances). In addition, TPM is efficient, with a runtime similar to that of BA and less than seconds for SYN.
7 Related Work
Data sanitization (a.k.a. knowledge hiding) aims at concealing patterns modeling confidential knowledge by limiting their frequency, so that they are not easily mined from the data. Existing methods are applied to: (I) a collection of setvalued data (transactions) [24] or spatiotemporal data (trajectories) [1]; (II) a collection of sequences [12, 14]; or (III) a single sequence [4, 16, 25]. Yet, none of these methods follows our CSD setting: Methods in category I are not applicable to string data, and those in categories II and III do not have guarantees on privacyrelated constraints [25] or on utilityrelated properties [12, 14, 4, 16]. Specifically, unlike our approach, [25] cannot guarantee that all sensitive patterns are concealed (constraint C1), while [12, 14, 4, 16] do not guarantee the satisfaction of utility properties (e.g., and P2).
8 Conclusion
In this paper, we introduced the Combinatorial String Dissemination model. The focus of this model is on guaranteeing privacyutility tradeoffs (e.g., C1 vs. and P2). We defined a problem (TFS) which seeks to produce the shortest string that preserves the order of appearance and the frequency of all nonsensitive patterns; and a variant (PFS) that preserves a partial order and the frequency of the nonsensitive patterns but produces a shorter string. We developed two timeoptimal algorithms, TFSALGO and PFSALGO, for the problem and its variant, respectively. We also developed MCSRALGO, a heuristic that prevents the disclosure of the location of sensitive patterns from the outputs of TFSALGO and PFSALGO. Our experiments show that sanitizing a string by TFSALGO, PFSALGO and then MCSRALGO is effective and efficient.
Acknowledgments.
HC is supported by a CSC scholarship. GR and NP are partially supported by MIURSIR project CMACBioSeq grant n. RBSI146R5L. We acknowledge the use of the Rosalind HPC cluster hosted by King’s College London.
References
 [1] Abul, O., Bonchi, F., Giannotti, F.: Hiding sequential and spatiotemporal patterns. TKDE 22(12), 1709–1723 (2010)
 [2] Aggarwal, C.C., Yu, P.S.: On anonymization of string data. In: SDM. pp. 419–424 (2007)
 [3] Aggarwal, C.C., Yu, P.S.: A framework for condensationbased anonymization of string data. DMKD 16(3), 251–275 (2008)
 [4] Bonomi, L., Fan, L., Jin, H.: An informationtheoretic approach to individual sequential data sanitization. In: WSDM. pp. 337–346 (2016)
 [5] Bonomi, L., Xiong, L.: A twophase algorithm for mining sequential patterns with differential privacy. In: CIKM. pp. 269–278 (2013)
 [6] Cazaux, B., Lecroq, T., Rivals, E.: Linking indexing data structures to de Bruijn graphs: Construction and update. J. Comput. Syst. Sci. (2016)

[7]
Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variablelength ngrams. In: CCS. pp. 638–649 (2012)
 [8] Cormode, G., Korn, F., Tirthapura, S.: Exponentially decayed aggregates on data streams. In: ICDE. pp. 1379–1381 (2008)
 [9] Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on strings. Cambridge University Press (2007)
 [10] Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC. pp. 265–284 (2006)
 [11] Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20(1), 50–58 (1980)
 [12] GkoulalasDivanis, A., Loukides, G.: Revisiting sequential pattern hiding to enhance utility. In: KDD. pp. 1316–1324 (2011)
 [13] Grossi, R., Iliopoulos, C.S., Mercas, R., Pisanti, N., Pissis, S.P., Retha, A., Vayani, F.: Circular sequence comparison: algorithms and applications. AMB 11, 12 (2016)
 [14] Gwadera, R., GkoulalasDivanis, A., Loukides, G.: Permutationbased sequential pattern hiding. In: ICDM. pp. 241–250 (2013)
 [15] Liu, A., Zhengy, K., Liz, L., Liu, G., Zhao, L., Zhou, X.: Efficient secure similarity computation on encrypted trajectory data. In: ICDE. pp. 66–77 (2015)
 [16] Loukides, G., Gwadera, R.: Optimal event sequence sanitization. In: SDM. pp. 775–783 (2015)
 [17] Malin, B., Sweeney, L.: Determining the identifiability of DNA database entries. In: AMIA. pp. 537–541 (2000)
 [18] Monreale, A., Pedreschi, D., Pensa, R.G., Pinelli, F.: Anonymity preserving sequential pattern mining. Artif. Intell. Law 22(2), 141–173 (2014)
 [19] Narayanan, A., Shmatikov, V.: Robust deanonymization of large sparse datasets. In: S&P. pp. 111–125 (2008)
 [20] Pissinger, D.: A minimal algorithm for the multiplechoice knapsack problem. Eur J Oper Res 83(2), 394–410 (1995)
 [21] Pissis, S.P.: MoTeXII: structured MoTif eXtraction from largescale datasets. BMC Bioinformatics 15, 235 (2014)
 [22] Sinha, P., Zoltners, A.A.: The multiplechoice knapsack problem. Operations Research 27(3), 431–627 (1979)
 [23] Theodorakopoulos, G., Shokri, R., Troncoso, C., Hubaux, J., Boudec, J.L.: Prolonging the hideandseek game: Optimal trajectory privacy for locationbased services. In: WPES. pp. 73–82 (2014)
 [24] Verykios, V.S., Elmagarmid, A.K., Bertino, E., Saygin, Y., Dasseni, E.: Association rule hiding. TKDE 16(4), 434–447 (2004)
 [25] Wang, D., He, Y., Rundensteiner, E., Naughton, J.F.: Utilitymaximizing event stream suppression. In: SIGMOD. pp. 589–600 (2013)
Appendix 0.A Omitted Proofs
Proof (Lemma 1)
Index in TFSALGO runs over the positions of string
; at any moment it indicates the ending position of the currently considered length
substring of . When (Lines 11) TFSALGO never appends , i.e., the last letter of a sensitive length substring, implying that, by construction of , no with occurs in .When (Lines 11) TFSALGO appends to , thus the order of and is clearly preserved. When and , index stores the starting position on of the length suffix of the last nonsensitive substring appended to (see also Fig. 1). C1 ensures that no sensitive substring is added to in this case, nor when . The next letter will thus be appended to when and (Lines 11). The condition on Line 1 is satisfied if and only if the last nonsensitive length substring appended to overlaps with the immediately succeeding nonsensitive one by letters: in this case, the last letter of the latter is appended to by Line 1, clearly maintaining the order of the two. Otherwise, Line 1 will append to , once again maintaining the length substrings’ order. Conversely, by construction, any occurs in only if it equals a length nonsensitive substring of . The only occasion when a letter from is appended to more then once is when Line 1 is executed: it is easy to see that in this case, because of the occurrence of , each of the repeated letters creates exactly one , without introducing any new length string over nor increasing the occurrences of a previous one. Finally, Line 1 does not introduce any new except for the one present in , nor any extra occurrence of the latter, because it is only executed when two consecutive nonsensitive length substrings of overlap exactly by letters.
It follows from the proof for C1 and P1.
Letter is added only by Line 1, which is executed only when and . This can be the case up to times as array can have alternate values only in the first positions. By construction, cannot start with (Lines 11), and thus the maximal number of occurrences of is . By construction, letter in is followed by at least letters (Line 1): the leftmost nonsensitive substring following a sequence of one or more occurrences of sensitive substrings in .
Upper bound. TFSALGO increases the length of string by more than one letter only when letter is added to (Line 1). Every time Lines 11 are executed, the length of increases by letters. Thus the length of is maximized when the maximal number of occurrences of is attained. This length is thus bounded by .
Tightness. For the lower bound, let and be sensitive. The condition at Line 1 is not satisfied because no element in is set to 0: . Then the condition on Line 1 is also not satisfied because , and thus TFSALGO outputs the empty string. A de Bruijn sequence of order over an alphabet is a string in which every possible length string over occurs exactly once as a substring. For the upper bound, let be the order de Bruijn sequence over alphabet , be even, and . and so Line 1 will add the first letters of to . Then observe that , and so on; this sequence of values corresponds to satisfying Lines 1 and 1 alternately. Line 1 does not add any letter to . The if statement on Line 1 will always fail because of the de Bruijn sequence property. We thus have a sequence of the nonsensitive length substrings of interleaved by occurrences of appended to . TFSALGO thus outputs a string of length (see Example 4).
Example 4 (Illustration of P3)
Let . We construct the order de Bruijn sequence of length over alphabet , and choose . TFSALGO constructs:
The upper bound of on the length of is attained.∎
Proof (Lemma 3)
Consider the following renaming technique. Each length substring of the collection is assigned a lexicographic rank from the range . Each string in is converted to a twoletter string as follows. The first letter is the lexicographic rank of its length prefix and the second letter is the lexicographic rank of its length suffix. We thus obtain a new collection of twoletter strings. Computing the ranks for all length substrings in can be implemented in time by employing radixsort to sort and then the wellknown LCP data structure over the concatenation of strings in [9]. The FOSSM problem is thus solved by finding a shortest string containing every element of as a distinct substring. Since consists of twoletter strings only we can solve the problem in time by applying Lemma 2. The statement follows.∎
Proof (Lemma 4)
C1 and P2 hold trivially for as no length substring over is added or removed from . Let . The order of nonsensitive length substrings within , for all , is preserved in . Thus for any pchain of , there is a pchain of such that ( is preserved). P3 also holds trivially for as no occurrence of is added. Since , for P4, it suffices to note that the construction of in the proof of tightness in Lemma 1 (see also Example 4) ensures that there is no suffixprefix overlap of length between any pair of length substrings of over due to the property of the order de Bruijn sequence. Thus the upper bound of on the length of is also tight for .
The minimality on the length of follows from the minimality of and the correctness of Lemma 3 that computes a shortest such string. ∎
Proof (Lemma 5)
We reduce the NPhard Multiple Choice Knapsack (MCK) problem [22] to in polynomial time. In MCK, we are given a set of elements subdivided into , mutually exclusive classes, , and a knapsack. Each class has elements. Each element has an arbitrary cost and an arbitrary weight . The goal is to minimize the total cost (Eq. 1) by filling the knapsack with one element from each class (constraint II), such that the weights of the elements in the knapsack satisfy constraint I, where constant represents the minimum allowable total weight of the elements in the knapsack:
(1) 
subject to the constraints: (I) , (II), and (III) .
The variable takes value if the element is chosen from class , otherwise (constraint III). We reduce any instance to an instance in polynomial time, as follows:

Alphabet consists of letters , for each and each class , .

We set . Every element of occurs exactly once: . Letter occurs times in . For convenience, let us denote by the th occurrence of in .

We set and .

and . The functions are otherwise not defined.
This is clearly a polynomialtime reduction. We now prove the correspondence between a solution to the given instance and a solution to the instance .
We first show that if is a solution to , then is a solution to . Since the elements in have minimum ,
Comments
There are no comments yet.