Privacy-preserving Data Splitting: A Combinatorial Approach

01/18/2018 ∙ by Oriol Farràs, et al. ∙ Universitat Rovira i Virgili 0

Privacy-preserving data splitting is a technique that aims to protect data privacy by storing different fragments of data in different locations. In this work we give a new combinatorial formulation to the data splitting problem. We see the data splitting problem as a purely combinatorial problem, in which we have to split data attributes into different fragments in a way that satisfies certain combinatorial properties derived from processing and privacy constraints. Using this formulation, we develop new combinatorial and algebraic techniques to obtain solutions to the data splitting problem. We present an algebraic method which builds an optimal data splitting solution by using Gröbner bases. Since this method is not efficient in general, we also develop a greedy algorithm for finding solutions that are not necessarily minimal sized.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data collected by companies and organizations is increasingly large and it is nowadays unfeasible for some data owners to locally store and process it because of the associated costs (such as hardware, energy and maintenance costs). The cloud offers a suitable alternative for data storage, by providing large and highly scalable storage and computational resources at a low cost and with ubiquitous access. However, many data owners are reluctant to embrace the cloud computing technology because of security and privacy concerns, mainly regarding the cloud service provider (CSP). The problem is not only that CSPs may read, use or even sell the data outsourced by their customers; but also that they may suffer attacks or data leakages that can compromise data confidentiality.

Privacy-preserving data splitting is a technique that aims to protect data privacy in this setting. Data splitting minimizes the leakage of information by distributing the data among several CSPs, assuming that they do not communicate with each other. Similar problems have been studied in other areas such as data mining, data sanitization, file splitting and data merging.

In general, in data splitting data sets are structured in a tabular format, according to a set of attributes (or features) identifiable by attribute names, as the table schema. Data is then composed by records, where each record holds up to one value per attribute. For instance, we can consider the attributes ‘Name’,‘Age’,‘Occupation’, and a record {‘John’,‘21’,‘Student’}, where the record holds values for all attributes.

Data splitting comes in three flavours: horizontal, vertical and mixed. In this work, we deal with vertical data splitting, where fragments consist of data on all records, but only contain information on a subset of the attributes. In horizontal data splitting, fragments contain part of the records, and information on all attributes is specified. In mixed data splitting, fragments hold partial information on some records.

Horizontal data splitting is not privacy-preserving by itself, because all the information of an individual register is stored together; hence, it does not preserve privacy by decomposition [2]. Horizontal data splitting has been used to analyze data collected by different entities on a set of patients [24], or in conjunction with homomorphic encryption [29], to mine horizontally-partitioned data without violating privacy.

Vertical data splitting can be used for privacy-preserving purposes [2, 23]. In particular, in a setting where some combinations of attributes constitute the sensitive information, the data set can be vertically split and distributed among CSPs so that no CSP holds any sensitive attribute combination. Assuming that CSPs do not communicate with each other, this measure guarantees privacy. An example of a sensitive pair of attributes in a medical data setting is passport number and disease, whereas blood pressure and disease constitute a generally safe pair.

The results we present in this work focus on data splitting, but they can be applied to other related areas such as file splitting, data sanitization, and data merging.

In file splitting, pieces of files owned by the same entity are stored in different sites. This is done in such a way that pieces from each site, when considered in isolation, are not sensitive. In [1], the authors spread the data across multiple CSPs and introduce redundancy in order to tolerate possible failures or outages. Their solution follows what has been done for years at the level of disks and file systems, particularly in the RAID (Redundant Array of Independent Disks) technology, which strips data across an array of disks and maintains parity data that can be used to reconstruct the contents of any individual failed disk. In [17], user’s files are categorized and split into chunks, and these chunks are provided to the proper storage servers. The categorization of data is done according to mining sensitivity. To ensure a greater amount of privacy, the possibility of adding misleading data into chunks depending on the demand of clients is given. Wei et al. [39] proposed a new privacy method that involves bit splitting and bit combination. In their approach, the original files are broken up through bit splitting and each fragment is uploaded to a different storage server.

Data sanitization is the process of removing sensitive information from a document so as to ensure that only the intended information can be accessed. Typically, the result is a document that is suitable for dissemination to the intended audience. Data sanitization has been applied along with data splitting in [12], where the terms in the input document that cause disclosure risk according to the privacy requirements are first detected, and then those terms are distributed in multiple servers in order to prevent disclosure.

Data merging consists on securely splitting and merging data from potentially many sources in a single repository. An approach for data merging is to split and compress the data into multiple fragments, and to require certain privacy constraints on the fragments [5].

In data splitting, once the data is split, the main issue is how to securely compute over distributed data (see [40] for more details). For some computations the servers may need to exchange data, but none of them ought to reveal its own private stored information. Computing over distributed data is also studied in the context of parallel processing for statistical computations. In this case, the problem is how to combine partial results obtained from independent processors. Related literature reduces statistical analysis to performing either secure distributed scalar products or secure distributed matrix products, e.g. see [18, 30, 34]. On a similar note, the field of privacy-preserving data mining deals with the problem of computing over distributed data. It has as its main objective to mine data owned by different parties, who are willing to collaborate in order to get better results, but that do not want or cannot share the raw original data. For instance, see [6, 13, 25].

1.1 Our results

In this work we give a new combinatorial formulation to the data splitting problem. In the considered data splitting problem, we force some subsets of attributes to be stored separately, because the combination of these attributes may reveal sensitive information to the CSPs. Moreover, we want to impose some subsets of attributes to be stored together, because we want to query on them efficiently or to compute statistics on them (data mining, selective correlations, etc.). Regarding privacy and security, the CSPs are not trusted and hence they are not given access to the entire original data set. We thus assume that the CSPs have just access to fragments of the original data set.

More specifically, we consider the honest-but-curious security model, where the CSPs honestly fill their role in the protocols and do not share information with each other, but they may try to infer information on the data available to them. In particular, each CSP may be curious to analyze the data it stores and the message flows received during the protocol in order to acquire additional information. Therefore, in our model the information leakage is the sensitive information that can be extracted from single stored data fragments. This model is common in the cloud computing literature, e.g. see [7].

In this setting, our main objective is to minimize the number of CSPs that are needed to store a data set using data splitting, without applying other privacy-preserving techniques. In order to study this problem, we set the data splitting constraints as two families of subsets of attributes: the family of subsets of attributes that have to be stored together, and the family of subsets of attributes that must not be jointly stored in any CSP. These two families respectively define processing and privacy constraints. We define a data splitting solution as a family of subsets of attributes which satisfies the processing and privacy constraints. Each set in this family must be outsourced to a single CSP. Therefore, we see the data splitting problem as a purely combinatorial problem, in which we have to split attributes into different fragments in a way that satisfies certain combinatorial properties derived from processing and privacy constraints.

Using this formulation, we develop new combinatorial and algebraic techniques to obtain solutions to the data splitting problem. We first present an algebraic method which builds a data splitting solution with the minimal number of fragments by using Gröbner bases. Since this method has performance issues, we also develop an efficient greedy algorithm for finding solutions that are not generally minimal sized. We compare the efficiency and the accuracy of the two approaches by giving experimental results. Using results of graph theory, we are able to provide necessary and sufficient conditions for the existence of a solution to the data splitting problem, and we give upper and lower bounds on the number of needed fragments.

1.2 Related work

Recently, data splitting research has focused on finding the minimal sized decomposition of a given data set into privacy-preserving fragments. Related works suggest outsourcing a sensitive data set by vertically splitting it according to some privacy constraints [2, 9, 10, 11, 23]. In all previously proposed methods, privacy constraints are described by sensitive pairs of attributes.

In [2], the authors study the problem of finding a decomposition of a given data set into two privacy-preserving vertical fragments, so as to store them in two CSPs which have to be completely unaware of each other. Query execution is also optimized, i.e. they minimize the execution cost of a given query workload, while obeying the constraints imposed by the needs of data privacy. Graph-coloring techniques are used to identify a decomposition with small query costs. In particular, their data splitting problem can be reformulated as a hypergraph-coloring instance of a graph . In case some sensitive attribute pairs can not be stored separately without increasing the number of fragments to more than two, encryption is used to ensure privacy. To improve the query workload, the storage of the same attribute in both CSPs is allowed.

The optimal decomposition problem described in [2] is hard to solve even if vertex deletion is not allowed. In fact, Guruswami et al. [26] proved that it is NP-hard to color a 2-colorable, 4-uniform hypergraph using only colors for any constant . This means that, in the case that all 4-tuples of attributes are sensitive, it is NP-hard to find a partition of attributes into two sets that satisfies all privacy constraints, even knowing that it exists. Because of the hardness of this problem, in [2]

they present three different heuristics to solve it.

A later article [23] studies the same scenario as [2]. Here as well, they consider vertically splitting data into exactly two fragments, though their results are easily extendable to more fragments. They also allow encrypting sensitive attributes and storing the same attribute in both fragments. They introduce three heuristics to find a decomposition with small query costs. These heuristic search techniques are based on the greedy hill climbing approach, and give a nearly optimal solution.

In [23], the authors study the time complexity of the proposed optimal decomposition problem in terms of the number of attributes. The general problem can theoretically be solved in polynomial time if the collection contains only few sets of constraints (by solving the minimum cut problem). It can also be solved in logarithmic time when the problem is equivalent to the hitting set problem. And it can be solved in an approximation factor of if each constraint set has size , by using directed multicat (i.e., solving the minimum edge deletion bipartition problem). The problem becomes intractable if the sets of constraints have size . In fact, in this case the problem is reduced to the not-all-equal -satisfiability problem, which is an NP-complete problem.

Also, [11] presents a solution for vertically splitting data into two fragments without requiring the use of encryption, but rather by using a trusted party (the data owner) to store a portion of the data and to perform part of the computation.

The solution presented in [10] uses both encryption and data splitting, but it allows the CSPs to communicate between each other. Because of this assumption, in order to ensure unlinkability between attributes, no attribute must appear in the clear in more that one fragment. In their solution, data is split into possibly more than two different fragments. This lowers the complexity of the problem with respect to [2] and [23]. The optimization problem is then to find a partition that minimizes the number of fragments and maximizes the number of attributes stored in the clear. Also, in this case, the problem of finding a partition of the attribute set is NP-complete. Hence, they present two heuristic methods with time complexity and , where is the number of privacy constraints and

is the number of attributes. The first one is based on the definition of vector minimality, and the second one works with an affinity matrix that expresses the advantage of having pairs of attributes in the same fragment.

A similar approach to [10] is also illustrated in [9], where they split a data set into an arbitrary number of non-linkable data fragments and distribute them among an arbitrary number of non-communicating servers.

The data splitting problem studied in this work is also related to other well known combinatorial optimization problems. We want to emphasize the connection with the job shop scheduling problem. The job shop scheduling problem consists on assigning jobs to resources at particular times. Welsh and Powell 

[38] described a basic scheduling problem as follows: let be a set of jobs. Suppose that it takes an entire day to complete each job, and that resources are unbounded. Let be an incompatibility matrix, where is zero or one depending on whether or not and can be carried out on the same day. The problem consists in scheduling the jobs using the minimum needed number of days according to the restrictions imposed by . An efficient algorithm to solve this problem is presented in [38], and subsequent works [31, 4] improve on this solution. See [8] for a survey on this and similar scheduling problems.

By interpreting jobs as attributes, days as data locations and the incompatibility matrix as a set of privacy constraints, we observe the equivalence between the problem posed in [38] and the data splitting problem. Through this same analogy, our setting extends to the following job scheduling problem: let be a set of jobs, and suppose that it takes an entire day to complete each job, and that resources are unbounded. Let be a family of sets of jobs that can not be carried out all on the same day. Similarly, let be a family of sets of jobs that must be carried out all on the same day. The problem consists in scheduling the jobs using the minimum needed number of days according to the restrictions imposed by and .

1.3 Outline of the work

Section 2 states the problem of privacy-preserving data splitting as a purely combinatorial problem which consists in splitting sensitive data into several fragments. This data splitting problem is stated as a covering problem. Section 3 presents an algebraic formulation of the covering problem stated in the previous section. Gröbner basis is used to find the optimal (i.e., minimal-sized) solutions. Section 4 proposes a linear-time method which solves the combinatorial problem. The solution optimality has been sacrificed to benefit efficiency. A heuristic improvement is also proposed. Section 5 presents the experimental results obtained by implementing the methods presented in Sections 3 and 4. First, a comparison between the methods on a real problem is depicted. Then, a performance analysis of the linear-time methods has been carried out over random graphs. Finally, Section 6 lists some conclusions.

2 A combinatorial approach

In this section we state the problem of privacy-preserving data splitting as a purely combinatorial problem. This problem consists in splitting a given data set in which some attributes are sensitive. As discussed above, this situation also covers problems of file splitting, data sanitization and data merging. First we introduce some notation.

Let be a set and let . For any , we define as the number of subsets in containing , and we define the degree of as the maximum of for every . For any , we also denote by the number of subsets such that . Note that for any we have . For a set , we define its closure . We define and as follows. A subset is in if and only if and it does not exist with . Analogously, a subset is in if and only if and it does not exist with . That is, a is the family of minimal subsets in , and is the family of maximal subsets in . We say that is an antichain if for every . In this case, .

In the considered data splitting setting we have a set of attributes

, and some combinations of the attributes are not to be stored by any individual server because they would leak sensitive information. We assume that individual attributes, when considered in isolation, are not sensitive (otherwise, encryption can be used). Moreover, we want some other attributes to be stored in the same location, for example to perform statistical analysis computations such as contingency tables, correlations or principal component analysis of the attributes. We thus describe a

data splitting problem using two families of attributes: is the family of subsets of attributes that cannot be stored together in any single server, and is the family of subsets of attributes that we want to be stored together in some server. We state the data splitting problem in terms of coverings, a notion first introduced in [22].

Definition 1.

Let . An -covering is a family of subsets of satisfying that

  1. for every and for every , , and

  2. for every there exists with .

Let be the families of subsets defined by the data splitting restrictions described above, and let be an -covering. Then defines a solution for data splitting by associating each fragment with a set . That is, we solve the data splitting problem by storing the data corresponding to attributes in at the -th location. Observe that, according to this definition, for each there is at least one fragment containing all attributes in , and none of the fragments contain all attributes in for any . These are exactly the restrictions we have for data splitting. Note that we distribute data in as many fragments as . Since is the family of subsets of attributes that we want to be stored together, we will always assume that .

Our work is focused on minimizing the size of the coverings, which corresponds to the number of fragments in data splitting. Therefore, we say that is an optimal -covering if is minimal among all -coverings. Also, it could be desirable to minimize , which corresponds to the total amount of information that will be stored, and , which corresponds to the maximum redundancy in the storage.

Example 2.

Let be an antichain, and let . Then is a .

Next we present some technical results about coverings. The main results of this section are Proposition 5, which characterizes the existence of coverings, and Proposition 7, which justifies the search of -coverings in the case that and are antichains. In addition, we present a theoretical lower bound on the size of -coverings in Proposition 8.

Lemma 3.

Let . Then is an -covering if and only if

  1. for every , and

  2. for every .

Proof.

Let be an -covering. For every and for every , , and so for any we have . Hence . For every , there exists with , i.e. . Hence . This concludes the proof of one implication.

For any , if then for every . For any , if then there exists with . Hence the converse implication holds. ∎

As a direct consequence of this lemma, we have the following result.

Lemma 4.

Let with and . Every -covering is also a -covering.

The next proposition characterizes the pairs of subsets that admit -coverings, and was presented in [22].

Proposition 5.

Let . There exists an -covering if and only if

(1)
Proof.

Let be a -covering. By Lemma 3, for every and , and , so . Conversely, if for every and , then is an -covering. ∎

Lemma 6.

Let . If

  • for every there exists with , and

  • for every there exists with ,

then any -covering is also a -covering.

Proof.

Let be a -covering. Let and let with . For every we have , and so . Now let and let with . Then there exists a subset satisfying , which also satisfies . Hence is an -covering. ∎

Proposition 7.

Let . Then is an -covering if and only if it is a -covering.

Proof.

By Lemma 4, every -covering is a -covering. The converse implication is a direct consequence of Lemma 6. ∎

According to the previous proposition, we can always restrict the search of -coverings to the case where and are antichains. Further, as a consequence of Lemma 6 we can define a partial hierarchy among the pairs of antichains . For example, every -covering is also a -covering.

To conclude this section, we describe a theoretical lower bound on the size of -coverings. Note that, in the case and , the problem of finding an -covering is equivalent to the graph coloring problem on the graph . In this case, the size of an optimal -covering is just the chromatic number .

Existing general lower bounds on the chromatic number include the clique number, the minimum degree bound, Hoffman’s bound, the vector chromatic number, Lovász number and the fractional chromatic number. Our proposed bound generalizes to the case of -coverings the minimum degree bound , where is the number of vertices and is the minimum degree of .

Proposition 8.

Let be families of subsets satisfying condition (1), and let be an -covering. Then

Proof.

Let be an -covering. Given , denote .

By the properties of -coverings we have that , and this implies that . Hence , and so . We now proceed to upper bound .

Since for every we have , we see that . Therefore, by the definition of -coverings we have that for every . Denote by the size of the largest subfamily of with this property, i.e.

By the preceding observation, we get that . By definition of , given any set we have , and so . Now, given a set , a family satisfies if and only if there exists an element such that . Therefore . Finally, by definition we see that . By composing the obtained results, we see that . The proposition follows by applying the first obtained inequality. ∎

2.1 Multi-colorings of hypergraphs

In order to construct -coverings, we will use colorings of hypergraphs. Let be a hypergraph. A coloring of with colors is a mapping such that for every there exists with .

Next we describe the connection between colorings and coverings. Let be a coloring of the hypergraph with colors. Consider the family of subsets of elements in of the same color according to . That is, consider a family of subsets that is a partition of satisfying that for every and for every .

Now consider the pair with . Observe that satisfies condition 1 in Definition 1, because if a subset is in , then it cannot be monochromatic. Since each element in has a color, condition 2 is also satisfied. In order to construct -coverings for other families of subsets , we can use sequences of colorings. In order to define appropriately these constructions, we consider multi-colorings of the hypergraph.

For any integer , we define a multi-coloring of of colors as a mapping with the following property: for every and for every , there exists for which the -th coordinate of is , namely . If we associate each with a different color, a multi-coloring of is a mapping that maps each element in to a set of at most colors. The mapping must satisfy that for every subset in and for each color, at least one element in does not have this color. A sequence of colorings of a hypergraph defines a multi-coloring. A multi-coloring defines in a natural way a family of subsets, and vice-versa. Given , we define , where is the subset of elements of mapped to the color .

Lemma 9.

Let , with . Then is an -covering if and only if defines a multi-coloring of of colors with the property that for every , there exists for which the -coordinate of is for every .

Proof.

Let be an -covering . We define a multi-coloring of colors as follows. For every and , if and only if is in . Let , and let be a subset in with . Then for every .

Taking into account the comments detailed above, it is straightforward to prove that the converse implication also holds. ∎

We use the connection between coverings and multi-colorings to find general constructions of coverings and upper bounds on their size. Beimel, Farràs, and Mintz constructed efficient secret sharing schemes for very dense graphs [3]. One of the techniques developed in that work is connected to our work. In [22], that result was described in terms of -coverings for . Due to Lemma 4, if , the biggest family of subsets admitting a -covering is . The next lemma states the results described above in a more general way.

Lemma 10.

Let be families of subsets satisfying condition (1). Let denote the degree of , and suppose that sets in and have size at most . Then there exists an -covering of degree and size .

2.2 Optimal covers

Both the optimization problem of determining the size of an optimal -covering and the search problem of finding an optimal -covering are NP-hard. This is so because making and transforms these problems to the corresponding graph coloring problems, and so there is a trivial reduction from the known NP-hard graph coloring problems to the -covering problems. Next, we see NP-completeness of the decisional problem.

Proposition 11.

The problem of deciding whether an -covering of size exists is NP-complete.

Proof.

Let define an instance of the problem where the answer is affirmative. Given an -covering of size , a checking algorithm first verifies that has size , that every is contained in some , and that no is contained in any . The running time of this checking algorithm is at most quadratic in the size of the problem input, and thus the given problem is in NP.

Now, note that the case and is equivalent to the graph coloring problem. Therefore the given problem is NP-complete. ∎

3 Algebraic formulation of the problem

In this section we present an algebraic formulation of the combinatorial problem presented in the previous section. The purpose of this formulation is to exploit algebraic techniques to find solutions to the data splitting problem for a fixed number of fragments.

It is not unusual that graph-coloring problems are encoded to polynomial ideals [15, 16, 28, 32]. In this case, the existence of a coloring is reduced to the solvability of a related system of polynomial equations over the algebraic closure of the field. Furthermore, the weak Hilbert’s Nullstellensatz theorem allows to obtain a certificate that a system of polynomial has no solutions [14], and, consequently, that the graph is not colorable. The focus of this section is the use of polynomial ideals and Gröbner basis to provide an optimal multi-coloring with the property described in Lemma 9. Recall that obtaining a multi-coloring is equivalent to finding an -covering.

Let be a hypergraph and be a multi-coloring of of colors with the property that for every , there exists for which the -coordinate of is for every . The multi-coloring can be seen as assignment of values to a set of variables , where , , and if and only if . In other words, we assign variables to each vertex in in such a way that,

Encoding to a polynomial ring allows an algebraic formulation of the multi-coloring problem. Since we focus on optimal multi-colorings, the number of colors is fixed to a designated minimal . Furthermore, each variable takes values in , which allows working over .

Therefore, given , we define the -coloring ideal to be the ideal generated by:


  • - all vertices belonging to an edge set cannot have the same color;


  • - there exists a color such that all the vertices in are colored with .

Theorem 12 proves that finding a solution of is equivalent to obtaining a multi-coloring .

Theorem 12.

Let a multi-coloring of , and assume . Then defines an -covering (in the sense of Lemma 9) if and only if has a common root in . In other words, the multi-coloring of does not define an -covering if and only if .

Proof.

is a multi-coloring map if it respects:

  • for every and for every , there exists for which is ;

  • for every , there exists for which the -coordinate of is for every .

It is known that if a polynomial encodes a property and encodes another property, than the ideal generated by and encodes the conjunction (i.e., and) of the properties. Therefore, if and encode the properties P1 and P2, respectively, then encodes .

  • : for all and for every color , we have that . This happens if and only if is iff there exists such that , which is equivalent to say that there exists such that .

  • : for all , we have . This happens if and only if is iff there exists such that iff there exists such that for all which is equivalent to say that there exists such that for all .

Observe that imposing is not restrictive. In fact, it is always possible to add the singletons of any vertices to in order to guarantee that a color is assigned to every vertex, without changing the request of the problem (see Example 14 for more details). In particular, if , then for all there exists such that and therefore, there exists a color such that which is equivalent to say that there exists such that , which is equivalent to say that there exists such that . The hypothesis has also been stated in Section 2.

Now that the data splitting problem is stated as an algebraic problem, a technique based on Gröbner basis can be used to solve it. Gröbner basis is a generating set of an ideal in a polynomial ring which allows to determine if any polynomial belongs to or not [14]. In other words, it allows to determine the variety associated to , i.e. the solutions of . It is proven that it is possible to associate a Gröbner basis to any polynomial ideal [14]. Informally, Gröbner basis computation can be viewed as a generalization of Gaussian elimination for non-linear equations.

In our case, Gröbner basis can be used to find the solutions of . Once the Gröbner basis of the -coloring ideal is obtained, the associated variety can be computed easily. The complexity of computing the Gröbner basis of a system of polynomial equations of degree in variables has been proven to be when the number of solutions is finite [21]. In general, its complexity is . Since belongs to , then it has a finite number of solutions, and so the Gröbner basis complexity bound is , which represents the worst-case complexity. The Gröbner basis complexity is at least that of polynomial-system solving.

As stated before, it is possible to derive a certificate that a system of polynomials has no equation from the weak Hilbert’s Nullstellensatz. In our case, this allows to prove that it is not possible to find a multi-coloring with a designated number of colors .

Theorem 13 (weak Hilbert’s Nullstellensatz [14]).

Suppose that . Then there are no solutions to the system in the algebraic closure of if and only if there exist such that

(2)

The set is called a Nullstellensatz certificate. The complexity of computing a certificate depends on the degree of , which is defined as the maximum degree of any . Fast results have been achieved for small constant degree of the Nullstellensatz certificate [33].

According to Theorem 13, using methods to compute a Nullstellensatz certificate it is possible to find out if has solutions or not. A tentative number of colors is fixed, and then the problem is solved by applying a Nullstellensatz certificate method. If we find that there exists no Nullstellensatz certificate, then does not have common root. The complexity of Nullstellensatz certificate and Gröbner basis methods grows with the number of variables which, in our case, grows with the number of colors. Therefore, it is convenient to start with few colors and increase them until a certificate of feasibility is found or until a Gröbner basis is computed. Consider that finding the optimal is a NP-complete problem, because it has complexity equivalent to solving the system of equations.

Example 14.

Given and , we want to compute an -covering. As explained above, the problem can be encoded to polynomial ideals. We assign variables to each attribute, where is the number of colors needed to obtain an optimal covering. For example, we can encode vertex to and when colors are considered. The variable is equal to if and only if vertex takes color , and to otherwise.

Therefore, is the ideal generated by the polynomials in and , where

  • .

  • .

Note that there does not exist an -covering of size one, because . Since we can compute the Gröbner basis of , there exist -coverings of size two and we obtain the optimal -coverings:

In the last solution, the variables and are missing, which means that they can take both and values. Therefore, the solutions can be re-written as the following coverings:

To obtain the number of colors which allows to compute an optimal covering, a tentative is fixed starting by . If the ideal is obtained as the result of the Gröbner basis method applied to , then the next is considered until the solution is different from . This is the smallest one for which we have an -covering, and thus it is optimal.

4 A greedy algorithm

In this section we aim for an efficient method to build -coverings and for upper bounds on the size of an optimal -covering.

As seen above, the problem of finding an optimal -covering is NP-hard. Hence, as expected, the labour involved in finding an optimal -covering can render methods inefficient when solving practical data splitting instances. Our strategy to circumvent this consists in sacrificing optimality to achieve a polynomial-time algorithm.

The problem of finding upper bounds on the size of an optimal -covering has been studied in the literature for the following particular cases:

  • In the case and , the problem of finding an -covering is easily seen to be equivalent to the graph coloring problem. Then is the chromatic number of the graph . For instance, the greedy coloring bound gives .

  • The case and has been studied as the clique covering and the clique partition numbers. Hall [27] and Erdős et al [19] showed that .

  • In the case and for , the problem of finding a -uniform -covering is equivalent to finding an -covering design. In this case, Spencer [36] showed that .

In the following, we first describe a general upper bound on the size of an optimal -covering. We then deduce from this bound an algorithm to build -coverings, and analyze its worst-time complexity. Finally, we introduce an heuristic improvement and a theoretical bound that improve the prior results for sparse enough .

4.1 Our construction

The next result generalizes the greedy coloring bound to -coverings. It gives a general bound of the size of an optimal -covering in terms of the degrees of and .

Theorem 15.

Let be families of subsets satisfying condition (1), and suppose that sets in have size at most . Then there exists an -covering of size

such that for every .

Proof.

We prove this by induction on . If , then satisfies the lemma. Now let be an integer and assume that the proposition holds for every pair of families of subsets satisfying and the proposition hypotheses. Let be a pair of families of subsets satisfying and the proposition hypotheses, and express for some fixed and . Then, by induction hypothesis, there exists an -covering with and such that every is contained in at most elements of . We now build an -covering from , in such a way that and that every is contained in at most elements of .

If is contained in some , then satisfies the lemma. Otherwise, let

Note that the condition is equivalent to and . Since there are at most elements with , and since every set of the form can be contained in at most elements of (because for every by hypothesis), we have that .

Therefore, either there exists an element , in which case we take

or , in which case and we let

Algorithm 1 is a greedy algorithm to compute an -covering that follows directly from the constructive proof of the previous lemma. This algorithm simply builds a ordered -covering by iterating through . Every set in is merged with the first available element of , i.e., with the first element such that no is contained in . If no such exists, then is added as a singleton in . Note that this algorithm is a generalization of the usual greedy coloring algorithm.

Input: ,
1 Initialize  for  do
2       if  is not contained in any  then
3             if there exists such that for every  then
4                  
5            else
6                  
7             end if
8            
9       end if
10      
11 end for
Output: The -covering
Algorithm 1 Construction

To see the worst-time complexity of Algorithm 1, note that the first loop (line 2) is repeated times. At step , the first if statement (line 3) requires checking at most inclusions, and the second if statement (line 4) requires checking at most inclusions. Therefore, Algorithm 1 runs in time .

4.2 An heuristic improvement

In order to motivate the heuristic procedure proposed later, we must first note that the output of Algorithm 1 depends strongly on the particular order in which elements of are taken in the first loop. In particular, we see in the following proposition that there always exists an optimal ordering of the elements of . Of course, since the problem of finding an optimal -covering is NP-hard and an optimal ordering can be verified in polynomial time, finding an optimal ordering in our case is NP-complete.

Proposition 16.

Let be families of subsets satisfying condition (1). Then there exists an ordering of such that Algorithm 1 outputs an optimal -covering.

Proof.

Let be an optimal -covering. For every , define to be the family of elements of that are contained in and that are not contained in any for ,

We first prove that defines a partition of .

Indeed, , because otherwise would be an -covering smaller than .

Also, for every . Otherwise, if with , then implies , and implies , a contradiction.

Finally, since every is contained in some element of by the definition of -covering, we can take with minimal index among those that contain . Then by definition, and therefore .

Now, define a new ordering of by taking the sets in sequentially. That is, if , define

Consider the behavior of Algorithm 1 on input . It is easy to see that, when the algorithm finishes processing the sets in , the local variable holds at most elements. Therefore, since the covering is optimal, Algorithm 1 outputs an optimal -covering. ∎

Following this result, we propose an heuristic procedure to build an ordering of , inspired in the Welsh-Powell algorithm [38]. This procedure can be deduced from the proof of the following proposition, which effectively reduces the upper bound given in Theorem 15 for sparse enough .

Proposition 17.

Assume the hypotheses of Theorem 15. Then there exists an -covering of size

such that for every .

Proof.

First reorder so that satisfies

Now, consider the behavior of Algorithm 1 on input and the reordered . At step , algorithm 1 processes . In this step, there can be at most sets such that does not satisfy the condition in line (that is, such that there exists with ). To see this, note that by definition at most elements intersect , and that each set of the form can be contained in at most elements of .

Now, at step the number of elements checked in the condition of line is at most . Since at step the family has at most sets, at most elements of are checked until either line or is executed, and line can add an additional element to . Hence, by iterating through all elements of , the size of the final output can be at most . ∎

We now state our heuristic improvement of Algorithm 1, which follows directly from the previous proof.

Input: ,
1 for  do
2       Compute
3 end for
Sort so that