String Sanitization: A Combinatorial Approach

06/26/2019
by   Giulia Bernardini, et al.
0

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user's location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct the shortest string preserving the order of appearance and the frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. Second, we propose a time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms may reveal the location of sensitive patterns. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in these strings with carefully selected letters, so that sensitive patterns are not reinstated and occurrences of spurious patterns are prevented. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2021

Quantum Algorithm for the Shortest Superstring Problem

In this paper, we consider the “Shortest Superstring Problem”(SSP) or th...
research
04/20/2020

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Algorithms to find optimal alignments among strings, or to find a parsim...
research
04/09/2020

Pattern Discovery in Colored Strings

We consider the problem of identifying patterns of interest in colored s...
research
11/08/2022

Comparing Two Counting Methods for Estimating the Probabilities of Strings

There are two methods for counting the number of occurrences of a string...
research
06/18/2023

Quantum Algorithms for the Shortest Common Superstring and Text Assembling Problems

In this paper, we consider two versions of the Text Assembling problem. ...
research
08/21/2023

DataVinci: Learning Syntactic and Semantic String Repairs

String data is common in real-world datasets: 67.6 1.8 million real Exce...
research
02/18/2018

Scalable Alignment Kernels via Space-Efficient Feature Maps

String kernels are attractive data analysis tools for analyzing string d...

Please sign up or login with your details

Forgot password? Click here to reset