DeepAI
Log In Sign Up

Topology of Privacy: Lattice Structures and Information Bubbles for Inference and Obfuscation

12/12/2017
by   Michael Erdmann, et al.
0

Information has intrinsic geometric and topological structure, arising from relative relationships beyond absolute values or types. For instance, the fact that two people share a meal describes a relationship independent of the meal's ingredients. Multiple such relationships give rise to relations and their lattices. Lattices have topology. That topology informs the ways in which information may be observed, hidden, inferred, and dissembled. Dowker's Theorem establishes a homotopy equivalence between two simplicial complexes derived from a relation. From a privacy perspective, one complex describes individuals with common attributes, the other describes attributes shared by individuals. The homotopy equivalence produces a lattice. An element in the lattice consists of two components, one being a set of individuals, the other being a set of attributes. The lattice operations join and meet each amount to set intersection in one component and set union followed by a potentially privacy-puncturing inference in the other component. Privacy loss appears as simplicial collapse of free faces. Such collapse is local, but the property of fully preserving both attribute and association privacy requires a global condition: a particular kind of spherical hole. By looking at the link of an identifiable individual in its encompassing Dowker complex, one can characterize that individual's attribute privacy via another sphere condition. Even when long-term attribute privacy is impossible, homology provides lower bounds on how an individual may defer identification, when that individual has control over how to reveal attributes. Intuitively, the idea is to first reveal information that could otherwise be inferred. This last result highlights privacy as a dynamic process. Privacy loss may be cast as gradient flow. Harmonic flow for privacy preservation may be fertile ground for future research.

READ FULL TEXT VIEW PDF

page 19

page 20

page 26

page 29

page 32

page 34

page 38

page 40

11/18/2022

How Do Input Attributes Impact the Privacy Loss in Differential Privacy?

Differential privacy (DP) is typically formulated as a worst-case privac...
08/25/2020

Local Generalization and Bucketization Technique for Personalized Privacy Preservation

Anonymization technique has been extensively studied and widely applied ...
09/08/2020

Attribute Privacy: Framework and Mechanisms

Ensuring the privacy of training data is a growing concern since many ma...
04/04/2020

Privacy Shadow: Measuring Node Predictability and Privacy Over Time

The structure of network data enables simple predictive models to levera...
04/27/2019

A Classification of Topological Discrepancies in Additive Manufacturing

Additive manufacturing (AM) enables enormous freedom for design of compl...
11/18/2022

Reconciling Shannon and Scott with a Lattice of Computable Information

This paper proposes a reconciliation of two different theories of inform...
12/04/2018

Hybrid Microaggregation for Privacy-Preserving Data Mining

k-Anonymity by microaggregation is one of the most commonly used anonymi...

1 Introduction

Privacy is the ability of an individual or entity to control how much that individual or entity reveals about itself to others. Fundamental research into privacy seeks to understand the limits of that ability.

A brief history of privacy should include the following:

  • The right to privacy as a legal principle, appearing in an 1890 Harvard Law Review article [24]. The article was a reaction to the then modern technology of photography and the dissemination of gossip via print media.

  • A demonstration linking supposedly anonymous public information with other more specific public data, thereby revealing sensitive attributes [21]. The demonstration employed zip code, gender, and birth date to link anonymous public insurance summaries with voter registration data. Doing so produced the health record of the governor of Massachusetts. This privacy failure suggested a first form of homogenization, called -anonymity. Roughly, the idea was to structure databases in such a way that a database could respond to any query with an answer consisting of no fewer than individuals matching the query parameters.

  • The discovery that it is impossible to preserve the privacy of an individual for even a single attribute in the face of repeated statistical queries over a population [2], unless  answers to those queries are purposefully perturbed with noise of magnitude on the order of at least . Here is the size of the population. The significance of this discovery is to underscore how difficult it is to preserve privacy while retaining information utility.

  • Netflix Prize. In 2006, Netflix offered a $1M prize for an algorithm that would predict viewer preferences better than Netflix’s internal algorithm. Netflix made available some of its historical user preferences, in anonymized form, as a basis for the competition. Once again, it turned out that one could link this anonymized data with other publicly available databases, resulting in the potential (and in some cases actual) identification of Netflix viewers, thereby de-anonymizing their viewing history [17]. Whereas in the earlier health example, a few specific observables made linking possible (global coordinates, one might say, namely zip code, gender, birth date), in the Netflix example, the intrinsic geometric structure of the database facilitated linking via a wide variety of observables (local landmarks, one might say, namely movies that were characteristic for each individual). Key was sparsity of information: 8 movie ratings and dates were generally enough to uniquely characterize of viewers in the Netflix Prize dataset, even with errors in the ratings and dates.

  • Differential Privacy [5, 4]

    seeks to avoid the previous privacy failures by focusing on local rather than absolute privacy guarantees. The underlying approach in differential privacy is for a database to answer statistical queries with a particular stochastic blurring. Specifically, the probability that an interrogator of the database will make any particular inference should depend only in a very small way on whether any one individual does or does not have a particular attribute (such as even being in the database). We might call this

    stochastic homogeneity.

  • Randomized Response. Differential privacy is further significant because it makes explicit the dynamic nature of privacy; there may be no enduring privacy guarantees but there are differential guarantees. A particular form is randomized response, a technique used in the social sciences to elicit reliable aggregate answers to sensitive questions, asking the question of many people, but perturbing individual answers stochastically so as not to learn much about any one individual from any single response [23]. A version has been employed by Google to find malware [8].

Privacy has both a combinatorial component and a statistical component. Prior research has largely focused on statistical techniques, both to preserve privacy and to puncture privacy. One of the goals of this research is to understand the combinatorial component of privacy, leading naturally to methods from combinatorial topology.

A desire to understand the geometry and topology of the types of inferences revealed by the Netflix Prize formed the specific motivation for our research initially. Subsequently, we realized that the lattice structure found in that geometry had broader applicability, providing an ability to model the dynamics of privacy more generally.

2 Outline

The remaining sections and appendices present the following material:

Main Narrative:

3:

Toy examples illustrating how a relation may lead to privacy loss in the presence of background information. The section introduces the doubly-labeled poset 

associated with a relation, to model such inferences. The elements of the poset are ordered pairs, each a set of individuals and a set of attributes.

This section also states and discusses assumptions that hold throughout the report.

4:

Formal description of the Galois connection associated with a relation. The section first defines, for any relation, two simplicial complexes called Dowker complexes. One complex represents sets of individuals with shared attributes, the other represents sets of attributes shared by individuals. The Galois connection then establishes a homotopy equivalence between the Dowker complexes, thereby generating the relation’s doubly-labeled poset. The homotopy equivalence gives rise to closure operators, with “closure” in the poset modeling inference of unobserved attributes from observed attributes (or unobserved individuals from observed individuals).

This section also defines attribute privacy  and association privacy.

5:

A characterization of privacy in terms of the absence of free faces in the relevant Dowker complex. This section observes as well that the only connected relations able to preserve both attribute and association privacy must look either like linear cycles or like boundary complexes. In particular, the number of individuals and attributes must be the same.

6:

Conditional relations, as models for simplicial links. A conditional relation is much like a conditional probability distribution. It might, for instance, represent the possible arrangement of remaining attributes among individuals, after some attributes have already been observed.

7:

A characterization of individual and group attribute privacy in terms of spherical and boundary complexes for the relation that models the individual’s or group’s link in its Dowker complex.

8:

A brief exploration of holes in relations, focusing on attribute spaces generated by bits.

9:

A small example exploring the possibility of increasing privacy by change-of-coordinate transformations.

10:

A lengthy exploration of how someone can delay identification, by releasing attributes selectively in a particular order. This idea leads to the notion of informative attribute release sequences, how to find such sequences in the Galois lattice, and the use of homology as a lower bound for the number and length of such sequences.

11:

Computation of the homology and maximal informative attribute release sequences present in two relations found on the world wide web. One relation describes Olympic athletes and their medals, the other describes jazz musicians and their bands.

12:

A more general perspective of inference as motion in lattices, not necessarily directly derived from a relation. This perspective suggests connections to randomized response techniques.

13:

An examination of the ability to obfuscate strategies and/or goals in graphs where motions may be nondeterministic or stochastic.

14:

A possible category for representing relations, along with an analysis of morphism properties. The morphisms between relations in this category induce simplicial and therefore continuous maps between the relations’ corresponding Dowker complexes.

This section further shows by example how a morphism of relations, when it is surjective at the set level, generates the full lattice of the codomain’s relation, via closure under lattice operations.  (A general proof appears in Appendix I.)

15:

Some thoughts for the future, including an example that connects stochastic sensing to the Galois lattice.

Appendices:

A:

A summary of the basic notation and definitions used in this report.

B:

A summary of the basic tools used in this report, establishing the homotopy equivalences and closure operators mentioned previously.

C:

Construction of links and deletions, and examination of the privacy properties each inherits from its encompassing relation. This appendix explores the significance of free faces in the Dowker complexes. The appendix further proves that a relation with more attributes than individuals cannot preserve attribute privacy for every individual.

D:

Proof that the problem of finding a minimal set of attributes from which another attribute may be inferred is -complete. This stands in contrast to the observation that the problem of finding some  set of attributes from which another may be inferred (or reporting that no such set exists) is computable in polynomial time.

E:

Detailed proofs of the results claimed in Section 7. Also a detailed proof of the assertion from Section 5 regarding relations that preserve both attribute and association privacy.

F:

Detailed proofs of the connection between maximal chains in a relation’s Galois lattice and informative attribute release sequences. When such sequences are order-independent they correspond to spherical holes, leading to the concept of an isotropic  sequence.

G:

Detailed proof that homology establishes a lower bound for the number and length of maximal chains in a relation’s Galois lattice, and thus for the number and length of informative attribute release sequences that may be used to delay identification.

H:

An application of the previous results with the aim of obfuscating the identification of strategies for attaining goals in graphs with uncertain transitions.

I:

Detailed proofs of the assertions of Section 14 regarding morphisms.

J:

Some additional examples:

  1. Dunce Hat: modeled as a relation for which the Dowker attribute complex is contractible but has no free attribute faces, meaning the relation preserves attribute privacy.

  2. Disinformation: An example that glues together two copies of the Möbius strip, thereby removing free faces and creating a form of homogeneity that preserves attribute privacy yet retains the utility of identifiability.

  3. Insufficient Representation: If there are insufficiently many individuals in a relation generated by bits, attribute inference is possible.

  4. A Matching Example: When many individuals are being observed, cardinality constraints allow for inferences beyond those discussed in this report.

List of Primary Symbols


Symbol Typical Meaning Page(s)
discrete space of individuals 1, A.4
discrete space of attributes 1, A.4
relation on 1, A.4
individuals with attribute (usually in the context of relation ) 1, A.4
attributes of individual (usually in the context of relation ) 1, A.4
another relation, often representing a link in a simplicial complex 7, 55, 19
generic simplicial complexes  (sometimes merely sets) A.1
complex; simplices are sets of individuals with a common attribute 1, A.4
complex; simplices are sets of attributes shared by some individual 1, A.4
usually a simplex representing individuals in
usually a simplex representing attributes in
homotopy equivalence from sets of individuals to shared attributes 2, A.4
homotopy equivalence from sets of attributes to sharing individuals 2, A.4
partially ordered set (poset) A.2
face poset of the simplicial complex 2, A.2
order complex of the poset 4.2, A.2
doubly-labeled poset associated with relation 3.3, 3, A.4
(inference) lattice (29) A.3
Galois lattice formed from 13, A.4
chain of length in the lattice 21, F, A.2
informative attribute release sequence (iars) of length (for relation ) 14
set of vertices in a simplicial complex or states in a graph
simplicial boundary complex with vertices 5.3, 1
sphere of dimension , modeling the empty complex A.1
circle 5.3
sphere of dimension 5.3, 1
group of simplicial -chains over , with integer coefficients A.1
(family of) reduced boundary map(s)   2
reduced -dimensional homology group of , with integer coefficients A.1
a graph, generally with nondeterministic and/or stochastic actions 13.1, 13.2
strategy complex of a graph 45, 13.2
source complex of a graph H.1
homotopy equivalence A.1
simplicial join A.1
either topological wedge sum or lattice join A.1, A.3
lattice meet A.3

3 Privacy: Relations and Partially Ordered Sets

Our investigation of privacy in this report will be in terms of relations. As we will see in this section and the next, relations give rise to simplicial complexes, which give rise to partially ordered sets, which expose an underlying lattice structure. That lattice structure makes explicit how privacy may be preserved or lost through so-called background knowledge. As we will see in Section 10, the lattice structure also makes explicit how identification may be delayed by careful release of information.

3.1 A Toy Example: Health Data and Attribute Privacy

Consider the following relation , describing the results of a hypothetical health study for four patients and three attributes. The patients have been anonymized and are represented simply by the set of numbers . The three attributes are drawn from the set .

One can describe a relation equivalently either as a matrix or as a set of ordered pairs:

Relation as a matrix:

Relation as a set of ordered pairs:

Assumptions

Before discussing privacy further, we make some assumptions that hold throughout the report:

Assumption of Relational Completeness:

We assume that any given relation is not missing any observable elements, relative to some external (unspecified) ground truth.

For example, if we observe that someone drinks soda and has cancer in relation , then we would conclude that we are observing individual #2. We would be surprised to see that individual smoke. If for some reason we ever do see the individual smoke, then we would deem our observations to be  inconsistent  with relation . — The meaning of inconsistency depends on context. At top-level, an inconsistency may mean that the relation or observation is errorful. When making conditional observations, an inconsistency may actually supply useful information, as we will see in Lemma 12 on page 12.

Comment: A relation may contain extra elements, as may be useful for disinformation. A relation could even be missing elements that represent valid ordered pairs, so long as those elements are deemed to be unobservable for that relation. For example, one may have a time series of relations in which some attributes only become observable at later times. In such a setting, one may never know whether a particular individual had a particular attribute at an earlier time.

In the example, it could be that individual #1 drinks soda, but that it is impossible to observe this fact. In that case, relation would still satisfy the assumption of relational completeness, even though contains no entry111Terminology: We often use the term 'entry' to mean an element of a relation, as in a matrix, or in one of its rows or columns. indicating that individual #1 drinks soda.

Assumption of Observational Monotonicity:

Even though we assume relations are complete, we do not  assume that observations are complete. Instead, we assume: The observation of a particular attribute for an individual is meaningful; lack of such an observation does not necessarily imply that the individual fails to have the unobserved attribute. The motivation for this assumption is that one may yet discover that the individual has the attribute. For example, suppose we observe someone (whom we know to be part of relation ) drinking soda. Even if that is all we observe, we do not  conclude that the individual is cancer free. It could be that we might yet observe the individual to have cancer.

If absence of an attribute is significant and  that absence is observable, then both the attribute and its negation could and perhaps should appear explicitly in the relation as distinct mutually exclusive attributes. For instance, Prime versus Composite might be such a pair of attributes for integers greater than 1.

Assumption of Observational Accuracy:

We assume that observations are accurate. For instance, if we observe an integer to be either Prime or Composite, then we do so correctly.

Comments:

The three assumptions above are desiderata for how the mathematical abstractions of this report fit into the real world.  Some comments are in order:

  • In and of itself, a relation defines a particular kind of world, a bipartite graph, and there is no external ground truth.

  • In such a world, the completeness, monotonicity, and accuracy assumptions describe a sensor and the meaning of observations made by the sensor.

The purpose of the assumptions in the real world is largely to ensure consistency between different relations and with possible observations.

  • The monotonicity assumption is important because information generally aggregates asynchronously. Together with the other assumptions, this assumption means that one may view relations as monotone Boolean functions, and thus may leverage methods from combinatorial topology.

  • One may incorporate some errors into the relational and observational models, for instance by blurring a relation. For very large integers, a relation might allow some integers to have both  Prime and Composite as attributes. Although an integer is one or the other, the relation admits to uncertainty by allowing both attributes at once. Indeed, some relations purposefully introduce such blurring to preserve privacy, as with randomized response [23]. In robotics, natural relational blurring arising from noisy but environment-compatible sensors can actually help establish the topology of a region, for instance by dualizing sensors and landmarks [11].

Privacy Implications

Making the health study of page 3.1 publicly available has some privacy implications, including the following:

  • Suppose someone named Bob tells his friend Alice that he was part of the study. Alice knows that Bob smokes everywhere he goes, so she can infer that he is Patient #1 and has cancer.  (This is an example of inference in a relation using background knowledge.)

  • Suppose Cindy is Patient #2. She has full attribute privacy as far as relation is concerned. In particular, as we saw already, Cindy can tell her friends that she was part of the health study while drinking soda and those friends will not be able to conclude that she has cancer.

  • Patients #3 and #4 are not only indistinguishable from each other but also from Cindy (patient #2), as far as relation is concerned. This is a very strong form of anonymity. Even if one of them reveals that s/he drinks soda, s/he will remain indistinguishable from the other two patients who drink soda.

Caveat: In the last case, if Cindy reveals that she has cancer and is seen to be different from the other individuals, then one may be able to remove her from the relation, narrowing the focus and creating a new relation that may allow additional inferences. Similar caveats hold for the other bullets. Deletions are discussed further in Appendix C.

Modifying a Relation to Increase Privacy

We can make a small change in relation that enhances privacy. If we artificially give patient #3 the attribute smokes, then we obtain the following modified relation :

Now Bob may reveal to Alice that he was part of the health study without Alice being able to infer that he has cancer, even though she knows that everyone knows that he smokes. In fact, more generally, one can no longer infer cancer from smoking, within the relation.

Such an artificial entry in the relation is a form of disinformation

. It certainly skews statistics and utility. It also increases privacy.

3.2 A Dual Perspective: Payroll Data and Association Privacy

The previous example examined a relation from the perspective of attribute privacy: we were interested in understanding how observation of some attribute(s) implied other attribute(s), possibly identifying an individual. A dual perspective is association privacy, in which one seeks to understand how some associations between individuals imply others.

The following hypothetical “salary” relation has the same matrix structure as relation did earlier, but with different semantics. This relation represents employees working on secret projects . Now the employee names are visible so that a payroll clerk can disburse salaries correctly, but the actual projects are anonymous.

The salary relation has some implications for association privacy, including the following:

  • If someone tells the payroll clerk that Julie is the lead of a very important project with valuable information, then the payroll clerk can infer that Mary and Frank have also been exposed to valuable information.

  • In contrast, if someone tells the payroll clerk that Bob is running a very important project, then the payroll clerk does not have enough information to conclude that Mary is also working on an important project.

Regarding disinformation: Observe how adding the artificial entry prevents the payroll clerk from using the relation to infer that Mary and Frank have valuable information, even if the payroll clerk learns via background information that Julie is the lead of a very important project with such information:

3.3 Privacy Preservation and Loss: A Poset Model

Figure 1: Relation serves as a model for the two examples of Sections 3.1 and 3.2. The doubly-labeled poset describes the inferences facilitated by .

Figure 1 shows a relation that serves as a model for both the health example of Section 3.1 and the payroll example of Section 3.2. The relation is identical to those given earlier, but with abstract labels in place of both individuals and attributes. The figure also depicts a partially ordered set (poset) , designed to model the inferences discussed previously. We refer to that poset as the doubly-labeled poset associated with . We next discuss the semantics of . Section 4 discusses the construction of . The underlying concepts are important throughout the report.

Semantics of the poset :

  • Each element in the poset consists of an ordered pair , with describing a set of individuals and describing a set of attributes. We say that the poset element is labeled with and . The meaning of such a double-labeling (with respect to the information described by relation ) is:

    1. All individuals in have all attributes in .

    2. If (and only if) an individual has at least all the attributes in , then that individual must be in . For example, we see that individual #2, and only individual #2, has both attributes and in .

    3. If (and only if) an attribute is shared by at least all individuals in , then that attribute must be in . For example, individual #1 has both attributes and , so cannot contain simply , but must contain .

  • The partial order for is described by the edges in the figure. There is an edge between two elements and of whenever the corresponding sets are subset comparable. In particular, in precisely when and . [Observe that the comparability ( versus ) is opposite for versus .]

Using the poset for attribute inference:

Suppose is any  nonempty subset of attributes in . Then one of (i) or (ii) holds:

  1. Perhaps no individual modeled by has all the attributes . For example, no individual has attributes . We would not expect to see and so does not appear in the poset .

  2. Alternatively, is a subset of at least one set of attributes that does appear in the poset. In this case, one may be able to enlarge nontrivially, resulting in privacy loss.

    For example, imagine we discover that a friend with attribute is modeled by the given relation (e.g., Bob, who smokes, says he is part of the health study ).
    Using , the poset then allows us to infer that Bob must also have attribute (that is, has_cancer).  Why?  Because is a minimal set in containing .

    We can say yet more: The element labeled with is also labeled with . So now we have de-anonymized  individual #1 (identifying him to be Bob).

    Regardless of whether Bob ever actually talks to us, the poset tells us that individual #1 could  suffer privacy loss, and in fact, is uniquely identifiable in the context of relation without needing to reveal everything about himself.

Similar reasoning is possible for association inference, as we saw earlier.

Figure 2: A relation , along with its doubly-labeled poset . The relation preserves attribute privacy but allows a small amount of association inference: If ones sees individual #4 in some context , then one can infer that individuals #2 and #3 are also present in that same context, without needing to observe them directly.

Disinformation Revisited:

Figure 2 shows relation , constructed from by adding an entry of disinformation, much as we constructed from earlier. The figure also shows the corresponding doubly-labeled poset . Observe that it is no longer possible to infer from , because now appears directly in the poset. The added entry in has increased attribute privacy compared to .

There is, however, still some opportunity for making association inferences. For instance, knowing that individual #4 (Julie, earlier) works on an important secret project still allows the inference that individuals #2 and #3 have valuable information. That is because the minimal set containing in the poset is . Notice that no such association inference is possible if someone says that individual #3 works on an important secret project, though that would have been possible in the original relation .

Comment:

Artificial entries can potentially also produce inferences of disinformation. For instance, if, in our earlier relation , the entry is artificial, then inferring that Bob has cancer from his smoking, when in fact Bob is healthy, would be disinformation.

4 The Galois Connection for Modeling Privacy

Section 3 showed by example how a relation determines a partially ordered set (poset) useful for modeling privacy. The elements in the poset are ordered pairs — a set of attributes and a set of individuals — that are equivalent from the relation’s perspective. Privacy loss occurs when an observer has data (for example, background knowledge) that is not directly in the poset but is a proper subset of some set of attributes or individuals in the poset. The observer may then infer some additional attributes or individuals. This section develops the connection between relations and posets more precisely, continuing to use the earlier examples for illustration. See also Appendices A and B for notation and additional material.

4.1 Dowker Complexes

Definition 1 (Dowker Complexes).

Let and be finite discrete spaces and let be a relation on . This means is a set of ordered pairs , with and . We frequently view/depict as a matrix of s and s, or as a matrix of blank and nonblank entries, with indexing rows and indexing columns.

  • We often refer to elements of as individuals and to elements of as attributes.

  • For each , let . Then consists of all attributes of individual . We may view as a row of .  We say that the row is blank if .

  • For each , let . Then consists of all individuals who have attribute . We may view as a column of .  The column is blank if .

  • We next define two simplicial complexes and (with some special cases below):

    Special cases:  If and/or , then we say  the relation is void. In this case, with some exceptions discussed later (see Section 6, Section 10, and Appendix C), we let and each be an instance of the void complex, containing no simplices. Otherwise, with and both nonempty, each of and contains at least the empty simplex .

    We refer to and as Dowker complexes, after the author of upcoming Theorem 2.
    We say that each complex is the Dowker dual of the other, with respect to relation .

    Interpretation: A nonempty set of attributes is a simplex in precisely when at least one individual has at least all the attributes in . We refer to any such individual as a witness for .

    Similarly, a nonempty set of individuals is a simplex in precisely when there is at least one attribute that is shared by at least all the individuals in . We refer to any such attribute as a witness for .

Figure 3 shows the Dowker complexes for the relation of Section 3.3.

Figure 3: Dowker simplicial complexes and determined by relation .

Dowker’s Theorem [3, 1] says that the two simplicial complexes and have the same homotopy type. As we will see, the maps establishing that homotopy equivalence define the doubly-labeled poset and describe how privacy may be lost.

Theorem 2 (Dowker Duality [3]).

Suppose is a relation on . Let and be as in Definition 1. Then and are homotopy equivalent.

Every nonvoid simplicial complex determines a partially ordered set called the face poset of . The elements of this poset are the nonempty  simplices of , partially ordered by set inclusion.  (Recall that 'poset' is short for 'partially ordered set'.)

For the finite setting, the homotopy equivalence of Dowker’s Theorem may be seen by explicit formulas for maps between the face posets of the two Dowker complexes. These maps describe what is known as a Galois connection. [This construction also appears as a core tool within the field of Formal Concept Analysis [25, 10].]  Here are the formulas:

These two maps are inverse homotopy equivalences. One sees this by considering the maps and . These compositions turn out to be what are called closure operators on the face posets and , respectively, implying that each is homotopic to an identity map, thereby establishing the desired homotopy equivalence. See Appendix B for detailed computations; see the next subsection for interpretation.

4.2 Inference from Closure Operators

An order-preserving poset map is said to be a closure operator whenever and for all . If is a closure operator, then it induces a homotopy equivalence between and the image .  See [1, 22, 19, 18] for more details.

One can think of a closure operator as “pushing elements up” in the poset. From a privacy perspective, “pushing up” amounts to inference. Specifically, consists of all additional attributes that may be inferred from observing attributes , while consists of all additional individuals that may be inferred from observing individuals .

Comment:

The formulas for and in Section 4.1 extend to the empty simplex and to the spaces and , suggesting “inferences from nothing”:  Observe that , so consists of all attributes that every individual in has. If , then the attributes are inferable “for free” from , that is, without making any observations. Similarly, consists of all individuals who have every attribute in .

Any poset defines a simplicial complex called the order complex  of . The simplices of are given by the finite chains in . Suppose we start with a nonvoid simplicial complex , construct its face poset , and then construct the order complex . The result is isomorphic to the first barycentric subdivision of [20, 22]. A convenient visualization of the face posets and therefore is to draw the first barycentric subdivisions of and , respectively, as in Figure 4.

Figure 4: Order complexes of the face posets of the complexes and shown in Figure 3.

Viewed in the order complexes, functions and are easy to visualize. They are fully determined by their actions on vertices of the order complexes, as shown in Table 1. (Bear in mind that each element of represents a simplex in but is a vertex in . Similarly, each element of represents a simplex in but is a vertex in .)

Table 1: The maps and , and their compositions, for relation of Figure 3.

Using Table 1, one can again see how privacy loss might occur via .

For instance, the map gives rise to the closure (i.e., a “pushing up”)

telling us how to infer unobserved attribute b from observed attribute a (in the health study example of Section 3.1, Alice could infer that Bob has_cancer from knowing that he smokes).

Similarly, for the map ,

leading to association inference (in the payroll example from Section 3.2, the payroll clerk could infer Bob and Mary’s exposure to valuable information after learning of Julie’s work on an important project).

Figure 5 indicates the homotopy deformations produced by the maps and , while Figure 6 shows the resulting image of each face poset.

Figure 5: Closure operators and produce homotopy deformations, indicated by directed edges. In , closes up to . In , most of the subsets of close up to . The exception is subset , which does not move.
Figure 6: Images of the closure operators of Figure 5.

Observe that these two images are isomorphic. Matching up corresponding elements produces the poset of Figure 1.

Summary:

A relation produces two simplicial complexes, and , one modeling attributes shared by individuals, the other modeling individuals with common attributes. The complexes are related by two maps, and , that are homotopy inverses. The compositions of these maps describe the attribute and association inferences possible via , leveraging background information someone may have. These inferences are summarized by a poset that pairs sets of individuals with sets of attributes. We may describe as follows:

Definition 3 (Doubly-Labeled Poset).

Let be a relation with nonvoid Dowker complexes.

The doubly-labeled poset associated with consists of all ordered pairs of sets such that , , , and .

The partial order on is defined by: if and only if

(and/or, equivalently, ).

See Appendix A.4, specifically page A.4, for some special cases.

(This definition agrees with the intuition that is both the image and the image , by Appendix B.)

4.3 Attribute and Association Privacy

Here are formal definitions for the intuition developed via the previous examples:

Definition 4 (Attribute Privacy).

Let be a relation with nonvoid Dowker complexes.

We say that preserves attribute privacy precisely when
       is the identity operator on the poset  .

Definition 5 (Association Privacy).

Let be a relation with nonvoid Dowker complexes.

We say that preserves association privacy precisely when
       is the identity operator on the poset  .

Comment:

For notational simplicity, we frequently say simply that

is the identity on  and/or that   is the identity on .

4.4 Disinformation Example Re-Revisited

Figure 7: The Dowker complexes, as well as the order complexes of their face posets, for the relation of Figure 2 on page 2. The closure operator is the identity on . The closure operator on closes many (but not all) subsets of up to , as indicated by the directed arrows. The result is a poset isomorphic to the poset of Figure 2, drawn again slightly differently in Figure 8. Also, . Thus relation preserves attribute privacy but not association privacy.
Figure 8: A flattened view of the doubly-labeled poset from Figure 2. Combined with Figure 7, this perspective shows how arises as the images of and under the closure operators and , respectively. (The vertices drawn as bigger dots in the current figure were higher up in the poset of Figure 2 than those drawn as smaller dots.)

Recall the relation of Figure 2 on page 2, which is relation of Figure 1 but with an added entry of disinformation. Figure 7 displays the resulting Dowker complexes and the actions of the closure operators. Figure 8 flattens out the poset of Figure 2, so one sees its triangle structure and how it is the image of the Dowker complexes under the closure operators for .

5 The Face Shape of Privacy

Figure 9: Relations and of Section 3, along with their attribute complexes and .

5.1 Free Faces

Figure 9 recapitulates relation and from the previous two sections, along with their Dowker attribute complexes, and , respectively. Recall that in one could make the inference , but no such inference was possible in .

The structure of suggests that the inference might  be possible in . In contrast, the structure of makes clear that such an inference is impossible in . In particular, observe how vertex a has only one incident edge in but has two incident edges in . The fact that there are two edges in , with those edges being maximal simplices, means, intuitively, that vertex a is being “pulled” in two different inference directions, so one cannot conclude anything additional from attribute a. In contrast, in , vertex a is being “pulled” only toward b, so it is plausible that attribute a might imply attribute b.

The underlying geometry is that of a free face. A simplex of a simplicial complex is said to be a free face of if it is a proper subset of exactly one maximal simplex of . That is true for in but not for in .

Of course, vertex also forms a free face in , yet one cannot make any inferences upon observing just attribute c. What is going on? The difference is that c is also an attribute of individuals in who have only  c as an attribute (specifically, individuals #3 and #4). Even though is technically a free face of , it is not really free to move under the closure operator , whereas is.

Observe that individuals #2, #3, and #4 all have attribute c, but only individual #2 has additional attributes. This means that individuals #3 and #4 cannot ever be identified uniquely in the context of relation ; they have effectively “camouflaged” themselves with individual #2, as far as relation is concerned. If one disallows or disregards such camouflage, then the idea of a free face and privacy loss are equivalent. The following definition is useful:

Definition 6 (Unique Identifiability).

Let be a relation on and suppose .
We say that is  uniquely identifiable via relation   when .

Suppose is a relation. Appendix C.3 proves that if has no free faces, then preserves attribute privacy. For the converse, Appendix C.3 further proves that if preserves attribute privacy and if every individual is uniquely identifiable, then has no free faces. (Dual statements hold for association privacy.)

5.2 Privacy versus Identifiability

Section 5.1 hinted at the difference between privacy and identifiability. In relation below (“I” for “individuality” or “identity”), every individual has exactly one attribute and that attribute uniquely identifies the individual. Relation preserves privacy fully (assuming ). It is impossible to make any attribute inferences. If Bob reveals that he has attribute , then Alice cannot infer any additional attributes for Bob. She now knows that Bob is individual but cannot infer any additional attributes. He has himself revealed everything about himself that there is to know, as far as relation is concerned.

In contrast, all individuals in relation (for “conformism” or “confusion”) have exactly the same set of attributes. As a result, there is no privacy: one can predict all the attributes of any individual in the relation without making any observations. On the other hand, no individual is uniquely identifiable (assuming ).

Homogeneity:

Relation exhibits a form of homogeneity often sought by anonymization or other privacy techniques. As we have suggested before, the utility of relation is essentially zero, unless one makes the entries stochastic, so that some utility is encoded in the distribution.

The discussion of free faces in Section 5.1 suggests an alternative approach to homogeneity: one may preserve privacy and retain utility by choosing the geometry of the relation appropriately, for instance, so the space exhibits sphere-like homogeneity. There will be considerable discussion of the importance of spheres in the rest of the report.

5.3 Spheres and Privacy

The attribute complex of Figure 9 is equal to a boundary complex, namely the boundary of the full simplex consisting of the attributes . We will denote boundary complexes by , with some nonempty set. The simplices of are all proper subsets of . Boundary complexes are homotopic to spheres, specifically , with . For of Figure 9, we have that . (In English: The Dowker attribute complex is the boundary of a triangle, so homotopic to a circle.)

More generally, if for some relation on , , then cannot have any free faces and so preserves attribute privacy.

Privacy and Utility:

An important observation is that boundary complexes exhibit homogeneity but still permit identifiability. If , with , and if no individual’s attributes are a subset of another’s attributes, then one can and needs to specify attributes in order to identify an individual. The boundary structure ensures that one cannot infer any attributes by specifying fewer than attributes, yet retains the ability to identify every individual.

Appendix J.1 gives an example of a contractible space that preserves attribute privacy. Observe, however, that the number of attributes needed to identify an individual in that example is considerably less than the total number of attributes in the space. For a boundary complex, it is just one less.

Preserving Attribute and Association Privacy:

A consequence of these observations is that if one wishes to preserve both attribute and association privacy with a connected relation, then one requires both Dowker complexes to look like spheres. More specifically, either both Dowker complexes are linear cycles of the same length or both are boundary complexes of the same dimension. In the latter case, the relation is isomorphic to a relation of the following form, in which the diagonal is blank but all other entries are present:

See Appendix E.3, starting on page E.3, for further details.

5.4 A Spherical Non-Boundary Relation that Preserves Attribute Privacy

Figure 10: A relation and its Dowker complexes and , each homotopic to the two-dimensional sphere . (One may view as two party hats glued together. One may view as a cylinder with a triangular cross-section and endcaps. However, the quadrilaterals drawn for the cylinder portion of are simply flattened sketches of what are actually solid tetrahedra.)

Consider relation as in Figure 10. Relation preserves attribute privacy, since has no free faces. The relation does not preserve association privacy. In particular, the quadrilaterals drawn for in the figure are actually tetrahedra. This means that the diagonals of the quadrilaterals are free faces. For instance, one would expect to infer individuals #1 and #6 as additional unobserved associates if one observes individuals #3 and #4. Indeed, computing using the closure operator , we see that:

Relation has another interesting feature. Even though is not itself a boundary complex, it is the simplicial join (see page A.1) of two boundary complexes:

In fact, we can think of as and as , with the restriction of to the attributes and the restriction of to the attributes . This join structure of means that we can view every individual in as being described by two independent attribute spaces. The attribute space acts like a standard bit; every individual has exactly one of these two attributes. In contrast, the attribute space is an “any 2 of 3” type of descriptor. Every individual has exactly two of these three attributes.

Figure 11 shows the relations and along with their Dowker attribute complexes.

Figure 11: Relation of Figure 10 decomposes into two disjoint relations and such that , with the boundary complex of a triangle and two isolated points. This means every individual in has attributes that act like two independent coordinates: an “any 2 of 3” component and a bit.

6 Conditional Relations as Simplicial Links

The decomposition of Figures 10 and 11 is reminiscent of stochastic independence expressed as multiplication of probabilities. Similarly, there is a combinatorial analogue to the notion of a conditional probability distribution. It appears as the link  of a simplex in a simplicial complex.

Given a relation , suppose we have observed attributes for some unknown individual. The remaining possible combinations of attributes we might yet observe are described by the simplicial complex . Interpretation: means that consists of as yet unobserved attributes, while means that there is some individual who has the attributes in addition to the attributes that we have already observed.

Figure 12: Relation describes the conditional relation resulting from of Figure 10 upon observing attribute d. Note that .

For instance, after observing attribute d in relation of Figure 10, we may conclude that we are observing one of the individuals in and that the remaining attributes we might yet observe are any two attributes drawn from . We can express these conclusions as yet another relation, namely the relation of Figure 12. Relation describes exactly which individuals could give rise to which attributes, consistent with the prior observation of attribute d. Thus plays a role much like a probability distribution, while plays the role of a conditional distribution. For another example, suppose we have observed attribute b in . Then the resulting conditional relation is as in Figure 13.

Figure 13: Relation describes the conditional relation resulting from of Figure 10 upon observing attribute b. Here . Observe that the attribute space for now factors into two independent bits: constitutes one bit, the other. This factoring is conditional on having observed b.

The formal constructions of conditional relations appear below.  See also Appendix C.1.

Notation:

A symbol of the form means “restrict to ”. For instance, if is a relation on , and if and , then  .

Definition 7 (Conditional Attribute Relations).

Let be a nonvoid relation on and suppose .  The following relation models :

The Dowker complexes are defined in the standard way, except for this special case:

If and , then we let and be instances of the empty complex .

Observe:

 (a proof appears in Appendix C.1, on page C.1).

Comment:

If , then and is void, and so is an instance of the void complex, consistent with the standard definition of being void in this situation.  (See page A.1 in Appendix A.1 for the definitions of void simplicial complex  and empty simplicial complex, and page A.4 in Appendix A.4 for the definition of void relation.)

There is a dual construction for links of individuals in the Dowker complex modeling associations:

Definition 8 (Conditional Association Relations).

Let be a nonvoid relation on and suppose .  The following relation models :

The Dowker complexes are defined in the standard way, except for this special case:

If and , then we let and be instances of the empty complex .

Observe:

.

As we will see in Section 7, the complex is useful for characterizing individual ’s attribute privacy. If that seems surprising, observe that describes other individuals in who share attributes with , with simplices modeling the extent of commonalities. These commonalities, or lack thereof, determine whether in , and thus back in , there are attributes of that are “free to move” under the closure operators.

7 Privacy Characterization via Boundary Complexes

Figure 14: With as in Figure 10, relation describes the conditional relation corresponding to . Also shown are the Dowker complexes of . By design, . Observe that is the boundary complex , with being all of individual #3’s attributes in relation . That boundary condition characterizes attribute privacy for an identifiable individual.  Here, it means that individual #3 has full attribute privacy.

We observed earlier that relation of Figure 10 preserves attribute privacy. We came to that conclusion after observing that has no free faces. In fact, one can focus on the privacy of any identifiable individual rather than look at the whole relation. Let us pick one such individual, say #3, and look at the conditional relation that models the link , as shown in Figure 14.  (Observe that individual #3 is indeed uniquely identifiable via .)

Individual #3 has attributes in . The attribute complex for is the boundary complex on exactly this set. Interpretation: for any nonempty proper subset of individual #3’s attributes, some  other  individual in has at least those attributes but not all of individual #3’s attributes. Consequently, there is a different such individual for each proper subset of that is missing exactly one of #3’s attributes. That diversity of individuals ensures individual #3’s attribute privacy.

The previous example suggests the following characterization: An identifiable individual has full attribute privacy precisely when the attribute complex of the individual’s link is the boundary complex of the individual’s attributes.

Observe that this characterization is local to the individual; it does not depend on other individuals having privacy.  We now formalize this intuition.  Proofs appear in Appendix E.

First, a definition to make precise the notion of individual privacy:

Definition 9 (Individual Privacy).

Let be a relation on and suppose .

We say that preserves attribute privacy for whenever for all .

Informally, we may also say that  individual has full attribute privacy.

Recall also Definitions 4 and 6, from pages 4 and 6, respectively, formalizing the notions of (attribute) privacy preservation and unique identifiability. And recall the semantics of , for instance from Definition 3 on page 3.

Here is the characterization of individual attribute privacy formalized:

Theorem 10 (Individual Attribute Privacy).

Let be a relation on , with . Suppose is uniquely identifiable via .  Let be the relation modeling .
Then the following three conditions are equivalent:

  1. preserves attribute privacy for .

  2. , with .

  3. .

The previous theorem generalizes to sets of individuals for sets that are “stable” under the closure operators, i.e., that appear as the “set of individuals component” in an element of :

Theorem 11 (Group Attribute Privacy).

Let be a relation on .
Suppose , with .  Let be the relation modeling .
Then the following three conditions are equivalent:

  1. , for every subset of .

  2. , with .

  3. .

The following lemma relates interpretation and inference in a link to the encompassing relation:

Lemma 12 (Interpreting Local Operators).

Let be a relation on .

Suppose , with .

Let be the relation on that models and suppose .

Then, for every :

  1. If , then .

  2. If , then .

    Moreover, in this case:

    For ,  .

    If ,  then .

The lemma says that observations of attributes consistent in have as interpretation more individuals in than just the individuals . However, if ever those observations become inconsistent in , then one has identified in . Here “inconsistent in ” means that the observed attributes are legitimate attributes for but do not constitute a simplex of . (Note: Such observed attributes necessarily constitute a simplex of since they are a subset of ).

Moreover, attribute inferences are identical in and for nonempty simplices of .

8 The Meaning of Holes in Relations

We have seen how spheres characterize privacy. More generally, when working with topological spaces, holes are significant. One wonders what topological holes mean for relations.

  • Some holes arise as a consequence of exclusion between attributes, as we saw in the decomposition of Figures 10 and 11.

    Sticking with binary exclusions, suppose a group of individuals are described by bits. One can model those individuals via a relation containing binary attributes (two such attributes per bit, one for each possible bit value). Every individual has exactly