A fundamental idea behind most forms of data-driven research and machine learning is the concept ofgeneralization
–the ability to infer properties of a data distribution by working only with a sample from that distribution. One typical approach is to invoke a concentration bound to ensure that, for a sufficiently large sample size, the evaluation of the function on the sample set will yield a result that is close to its value on the underlying distribution, with high probability. Intuitively, these concentration arguments ensure that, for any given function, most sample sets are good “representatives” of the distribution. Invoking a union bound, such a guarantee easily extends to the evaluation of multiple functions on the same sample set.
Of course, such guarantees hold only if the functions to be evaluated were chosen independently of the sample set. In recent years, grave concern has erupted in many data-driven fields, that adaptive selection of computations is eroding statistical validity of scientific findings [Ioa05, GL14]. Adaptivity is not an evil to be avoided—it constitutes a natural part of the scientific process, wherein previous findings are used to develop and refine future hypotheses. However, unchecked adaptivity can (and does, as demonstrated by, e.g., [DFH15b] and [RZ16]) often lead one to evaluate overfitting functions—ones that return very different values on the sample set than on the distribution.
Traditional generalization guarantees do not necessarily guard against adaptivity; while generalization ensures that the response to a query on a sample set will be close to that of the same query on the distribution, it does not rule out the possibility that the probability to get a specific response will be dramatically affected by the contents of the sample set. In the extreme, a generalizing computation could encode the whole sample set in the low-order bits of the output, while maintaining high accuracy with respect to the underlying distribution. Subsequent adaptive queries could then, by post-processing the computation’s output, arbitrarily overfit to the sample set.
In recent years, an exciting line of work, starting with Dwork et al. [DFH15b], has formalized this problem of adaptive data analysis and introduced new techniques to ensure guarantees of generalization in the face of an adaptively-chosen sequence of computations (what we call here adaptive generalization). One great insight of Dwork et al. and followup work was that techniques for ensuring the stability of computations (some of them originally conceived as privacy notions) can be powerful tools for providing adaptive generalization.
A number of papers have considered variants of stability notions, the relationships between them, and their properties, including generalization properties. Despite much progress in this space, one issue that has remained open is the limits of stability—how much can the stability notions be relaxed, and still imply generalization? It is this question that we address in this paper.
1.1 Our Contribution
We introduce a new notion of the stability of computations, which holds under post-processing (Theorem 2.4) and adaptive composition (Theorems 2.9 and 2.10), and show that the notion is both necessary (Theorem 3.13) and sufficient (Theorem 3.9) to ensure generalization in the face of adaptivity, for any computations that respond to bounded-sensitivity linear queries (see Definition 3.1) while providing accuracy with respect to the data sample set. This means (up to a small caveat)111In particular, our lower bound (Theorem 3.13) requires one more query than our upper bound (Theorem 3.9).
that our stability definition is equivalent to generalization, assuming sample accuracy, for bounded linear queries. Linear queries form the basis for many learning algorithms, such as those that rely on gradients or on the estimation of the average loss of a hypothesis.
In order to formulate our stability notion, we consider a prior distribution over the database elements and the posterior distribution over those elements conditioned on the output of a computation. In some sense, harmful outputs are those that induce large statistical distance between this prior and posterior (Definition 2.1). Our new notion of stability, Local Statistical Stability (Definition 2.2), intuitively, requires a computation to have only small probability of producing such a harmful output.
In Section 4, we directly prove that Differential Privacy, Max Information and Compression Schemes all imply Local Statistical Stability, which provides an alternative method to establish their generalization properties. We also provide a few separation examples between the various definitions.
1.2 Additional Related Work
Most countermeasures to overfitting fall into one of a few categories. A long line of work bases generalization guarantees on some form of bound on the complexity of the range of the mechanism, e.g., its VC dimension (see [SSBD14] for a textbook summary of these techniques). Other examples include Bounded Description Length [DFH15a], and compression schemes [LW86] (which additionally hold under post-processing and adaptive composition [DFH15a, CLN16]). Another line of work focuses on the algorithmic stability of the computation [BE02], which bounds the effects on the output of changing one element in the training set.
A different category of stability notions, which focus on the effect of a small change in the sample set on the probability distribution over the range of possible outputs, has recently emerged from the notion of Differential Privacy[DMNS06]. Work of [DFH15b] established that Differential Privacy, interpreted as a stability notion, ensures generalization; it is also known (see [DR14]) to be robust to adaptivity and to withstand post-processing. A number of subsequent works propose alternative stability notions that weaken the conditions of Differential Privacy in various ways while attempting to retain its desirable generalization properties. One example is Max Information [DFH15a], which shares the guarantees of Differential Privacy. A variety of other stability notions ([RRST16, RZ16, RRT16, BNS16, FS17, EGI19]), unlike Differential Privacy and Max Information, only imply generalization in expectation. [XR17, Ala17, BMN17] extend these guarantees to generalization in probability, under various restrictions.
[CLN16] introduce the notion of post-hoc generalization, which captures robustness to post-processing, but it was recently shown not to hold under composition [NSS18]. The challenges that the internal correlation of non-product distributions present for stability have been studied in the context of Inferential Privacy [GK16] and Typical Stability [BF16].
2 LS stability definition and properties
Let be an arbitrary countable domain. Fixing some , let be some probability distribution defined over .222Throughout the paper, can either denote the family of sequences of length or a multiset of size ; that is, the sample set can be treated as an ordered or unordered set. Let be some arbitrary countable sets which we will refer to as generators and responses respectively. Let a mechanism be a (possibly non-deterministic) function that, given a sample set and a generator , returns a response . Intuitively, generators can be thought of as representing functions from to and the mechanism as providing an estimate to the value of those functions, but we do not restrict the definitions, for reasons which will become apparent once we formalize the adaptive process (Definition 2.5).
This setting involves two sources of randomness, the underlying distribution , and the conditional distribution —that is, the probability to get as the output of . These in turn induce a set of distributions (formalized in Definition A.1): the joint distribution over , the disjoint (independent) distribution over , the unconditional (marginal) distribution over , and the conditional distribution over .
Although the underlying distribution is defined on , it induces a natural probability distribution over as well, as follows. We define the sampling function which, given a sample set, returns one of its sample elements uniformly at random. Notice that can be thought of as a mechanism with one possible generator and with response range , which allows us to define , and as well.333The superscript notation of is missing from the definition, since there is only one possible “generator” in this case. It is worth noting that in the case where is the product distribution of some distribution over , we get that . This in turn allows us to define a few key distributions, which form a connection between and (formalized in Definition A.2): the joint distribution over , the disjoint distribution over , the conditional distribution over , and the conditional distribution over . We use this notation to denote both the probability that a distribution places on a subset of its range and the probability placed on a single element of the range.
2.1 Local Statistical Stability
Before observing any output from the mechanism, an outside observer knowing but without other information about the sample set holds prior that sampling an element of would return a particular . Once an output of the mechanism is observed, however, the observer’s posterior becomes . The difference between these two distributions is what determines the resulting degradation in stability. This difference could be quantified using a variety of distance measures (a partial list can be found in Appendix E); here we introduce a particular one which we use to define our stability notion.
Definition 2.1 (Stability loss of a response).
Given a distribution , a generator , and a mechanism , the stability loss of a response with respect to and is defined as the Statistical Distance (see definition in Appendix E) between the prior distribution over and the posterior induced by . That is,
, the set of all sample elements which have a posterior probability (given) higher then their prior. Notice this notation omits on which it depends. Similarly, we define the stability loss of a set of responses with respect to and as,
Given , a response will be called -unstable with respect to and if its loss is greater the . The set of all -unstable responses will be denoted .
We now introduce our notion of stability of a mechanism.
Definition 2.2 (Local Statistical Stability).
Given , a distribution , and a generator , a mechanism will be called -Local-Statistically Stable with respect to and (or LS Stable, or LSS, for short) if for any ,
Notice that the maximal value of the left hand side is achieved for the subset . This stability definition can be extended to apply to a family of generators and/or a family of possible distributions. When there exists a family of generators and a family of distributions such that a mechanism is -LSS for all and for all , then will be called -LSS for . (This stability notion somewhat resembles Semantic Privacy as discussed by [KS14], though they use it to compare different posterior distributions.)
Intuitively, this can be thought of as placing a bound on the probability of observing an outcome whose stability loss exceeds . This claim is formalized in the next Lemma.
Given , a distribution , and a generator , if a mechanism is -LSS with respect to , then .
Assume by way of contradiction that ; then
We now turn to prove two crucial properties of LSS: post-processing and adaptive composition.
Post-processing guarantees (in some contexts, known as data processing inequalities) ensure that the stability of a computation can only be increased by subsequent manipulations. This is a key desideratum for concepts used to ensure adaptivity-proof generalization, since otherwise an adaptive subsequent computation could potentially arbitrarily degrade the generalization guarantees.
Theorem 2.4 (LSS holds under Post-Processing).
Given , a distribution , and a generator , if a mechanism is -LSS with respect to and , then for any range and any arbitrary (possibly non-deterministic) function , we have that is also -LSS with respect to and . An analogous statement also holds for mechanisms that are LSS with respect to a family of generators and/or a family of distributions.
We start by defining a function such that . Using this function we get that,
(detailed proof can be found in Appendix B.1).
Combining the two we get that,
where (1) results from the two previous claims, (2) from the fact that we removed only negative terms and (3) from the LSS definition, which concludes the proof. ∎
In order to formally define adaptive learning and stability under adaptively chosen generators, we formalize the notion of an adversary who issues those generators.
Definition 2.5 (Adversary and Adaptive Mechanism).
An adversary over a family of generators is a particular type of generator which is a (possibly non-deterministic) function that receives a view—a finite sequence of responses—and outputs a generator. We denote by the family of all adversaries, and write and .
Illustrated below, the adaptive mechanism is a particular type of mechanism, which inputs an adversary as its generator and which returns a view as its range type. It is parameterized by a set of sub-mechanisms where , . Given a sample set and an adversary as input, the adaptive mechanism iterates times through the process where sends a generator to and receives its response to that generator on the sample set. The adaptive mechanism returns the resulting sequence of responses . Naturally, this requires to match such that ’s range can be ’s input, and vice versa.444If the same mechanism appears more then once in , it can also be stateful, which means it retains an internal record consisting of internal randomness, the history of sample sets and generators it has been fed, and the responses it has produced; its behavior may be a function of this internal record. We omit this from the notation for simplicity, but do refer to this when relevant. A stateful mechanism will be defined as LSS if it is LSS given any reachable internal record. A pedantic treatment might consider the probability that a particular internal state could be reached, and only require LSS when accounting for these probabilities. 555If is randomized, we add one more step at the beginning where randomly generates some bits —’s “coin tosses.” In this case, and receives the coin tosses as an input as well. This addition turns into a deterministic function of for any , a fact that will be used multiple times throughout the paper. In this situation, the randomness of results both from the randomness for the coin tosses and from that of the sub-mechanisms.
Adaptive Mechanism Input: Output: or for : or return
Definition 2.6 (-LSS under adaptivity).
Given , a distribution , and an adversary , a sequence of mechanisms will be called -local-statistically stable under adaptive iterations with respect to and (or -LSS for short), if is -LSS with respect to and (in which case we will use to denote the set of unstable views). This definition can be extended to a family of adversaries and/or a family of possible distributions as well.
Adaptive composition is a key property of a stability notion, since it restricts the degradation of stability across multiple computations. A key observation is that the posterior is itself a distribution over and is a deterministic function of . Therefore, as long as each sub-mechanism is LSS with respect to any posterior that could have been induced by previous adaptive interaction, one can reason about the properties of the composition.
Definition 2.7 (View-induced posterior distributions).
A sequence of mechanisms , an adversary , and a view together induce a set of posterior distributions over , , and . For clarity we will denote these induced distributions by instead of .
As mentioned before, all the distributions we consider stem from two basic distributions; the underlying distribution and the conditional distribution . The posteriors of these distributions change once we see . is replaced by (actually, the rigorous notation should have been , but since and will be fixed throughout this analysis, we omit them for simplicity). Similarly, is replaced by
where denotes the first iterations of the adaptive mechanism.666If is stateful, the conditioning can result from any unknown state of which might affect its response to . If has no shared state with the previous sub-mechanisms (ether because it is a different mechanism or because it is stateless), then the only effect has on the posterior on is by governing (which, as mentioned before, is a deterministic function of for the given ), in which case where the mechanism is .
We next establish two important properties of the distributions over induced by and their relation to the posterior distributions.
Given a distribution , an adversary , and a sequence of mechanisms where , , for any we denote . In this case, using notation from Definition 2.7,
The proof can be found in Appendix B.2
Bounding the stability loss of a view by the sum of losses of its responses with respect to the sub-mechanisms, provides a linear bound on the degradation of the LSS parameters. Adding a bound on the expectation of the loss of the sub-mechanisms allows us to also invoke Azuma’s inequality and prove a sub-linear bound.
Theorem 2.9 (LSS adaptively composes linearly).
Given a family of distributions over , a family of generators , and a sequence of mechanisms where , , we will denote , and for any , the set of all posterior distributions induced by any response of with non-zero probability with respect to and .
Given a sequence , if for all , is -LSS with respect to and , the sequence is --LSS with respect to and any adversary over .
One simple case is when , and is -LSS with respect to and , for all .
Proof of Theorem 2.9.
This theorem is a direct result of combining Lemma 2.8 with the triangle inequality over the posteriors created at any iteration, and the fact that the the mechanisms are LSS over the new posterior distributions. Formally this is proven using induction on the number of adaptive iterations. The base case is the coin tossing step, which is independent of the set and therefore has zero loss. For the induction step we start by denoting the projections of on and by,
where . Using this notation and that in Definition 2.7 we get that
Detailed proof can be found in Appendix B.2. ∎
Theorem 2.10 (LSS adaptively composes sub-linearly).
Under the same conditions as Theorem 2.9, and given , such that for all and any , and , where the expectation is taken over the randomness of the choice of and the internal probability of , then for any , the sequence is --LSS with respect to and any adversary over , where .
The theorem is non-trivial for .
Proof of Theorem 2.10.
The proof is based on the fact that the sum of the stability losses is a martingale with respect to , and invoking Lemma B.1, which extends Azuma’s inequality to the case of a high probability bound.
Formally, for any given , we can define and .777 If the adversary is non-deterministic, , where is the set of all possible coin tosses of the adversary, as mentioned in Definition 2.5. If the mechanisms have some internal state not expressed by the responses, will be the domain of those states, as mentioned in Definition 2.7. We define a probability distribution over as , and for any , define a probability distribution over given as
. We then define a sequence of random variables,and ,
Intuitively is the sum of the first losses, with a correction term which zeroes the expectation. These random variables are a martingale with respect to the random process , since
where the expectation is taken over the random process, which has randomness that results from the choice of and the internal probability of (detailed proof can be found in Appendix B.2).
Using this fact we can invoke Lemma B.1 and get that for any ,
3 LSS is Necessary and Sufficient for Generalization
Up until this point, generators and responses have been fairly abstract concepts. In order to discuss generalization and accuracy, we must make them concrete. As a result, in this section, we often consider generators in the family of functions , which we will refer to as queries and denote by , and we consider responses which have some metric defined over them. We show our results for a fairly general class of functions known as bounded linear queries.888For simplicity, throughout the following section we choose , but all results extend to any metric space, in particular .
Definition 3.1 (Linear queries).
A function will be called a linear query, if it is defined by a function such that (for simplicity we will slightly abuse notation and denote simply as throughout the paper). If it will be called a -bounded linear query. The set of -bounded linear queries will be denoted .
In this context, there is a “correct” answer the mechanism can produce for a given generator, defined as the correct response to the query on the sample set or distribution, and its distance from the response provided by the mechanism can be thought of as the mechanism’s error.
Definition 3.2 (Sample accuracy, distribution accuracy).
Given , , a distribution , and a query , a mechanism will be called -Sample Accurate with respect to and , if
Such a mechanism will be called -Distribution Accurate with respect to and if
where . In both cases the probability is taken over the randomness of the choice of and the internal probability of . The expectation is taken only over the randomness of the choice of . When there exists a family of distributions and a family of queries such that a mechanism is -Sample (Distribution) Accurate for all and for all , then will be called -Sample (Distribution) Accurate with respect to and .
A sequence of mechanisms where which respond to a sequence of (potentially adaptively chosen) queries will be called --Sample Accurate with respect to and if
and --Distribution Accurate with respect to and if
When considering an adaptive process, accuracy is defined with respect to the adversary, and the probabilities are taken also over the choice of the coin tosses by the adaptive mechanism.999If the adaptive mechanism invokes a stateful sub-mechanism multiple times, we specify that the mechanism is sample (distribution) accurate if it is sample (distribution) accurate given any reachable internal record. Again, a somewhat more involved treatment might consider the probability that a particular internal state of the mechanism could be reached.
We denote by the set of views consisting of responses in .
3.1 LSS Implies Generalization
As a step toward showing that LS Stability implies a high probability generalization, we first show a generalization of expectation result. We do so, as a tool, specifically for a mechanism that returns a query as its output. Intuitively, this allows us to wrap an entire adaptive process into a single mechanism. Analyzing the potential of the mechanism to generate an overfitting query is a natural way to learn about the generalization capabilities of the mechanism.
Theorem 3.3 (Generalization of expectation).
Given , a distribution , a generator , and a mechanism , if , then
The expectations are taken over the randomness of the choice of and the internal randomness of .
First notice that,
where denotes the elements of the sample set . Using this identity we separately analyze the expected value of the returned query with respect to the distribution, and with respect to the sample set (detailed proof can be found in Appendix C).
Now we can calculate the difference:
where (1) results from the definition of and the triangle inequality, and (2) from the condition that . ∎
Given , a distribution , and a generator , if a mechanism is -LSS with respect to , then
We proceed to lift this guarantee from expectation to high probability, using a thought experiment known as the Monitor Mechanism, which was introduced by [BNS16]. Intuitively, it runs a large number of independent copies of an underlying mechanism, and exposes the results of the least-distribution-accurate copy as its output.
Definition 3.5 (The Monitor Mechanism).
The Monitor Mechanism is a function which is parametrized by a sequence of mechanisms where , . Given a series of sample sets and adversary as input, it runs the adaptive mechanism between and for independent times (which in particular means neither of them share state across those iterations) and outputs a query , response and index , based on the following process:
Monitor Mechanism Input: Output: for : 101010We slightly abuse notation since is not part of , but since it can be recovered from it, this term is well defined. if :111111 The addition of this condition ensures that for the output of the mechanism, a fact that will be used later in the proof of Lemma 3.8. else: return
Notice that the monitor mechanism makes use of the ability to evaluate queries according to the true underlying distribution.121212Of course, no realistic mechanism would have such an ability; the monitor mechanism is simply a thought experiment used as a proof technique.
We begin by proving a few properties of the monitor mechanism. In the following claims, the probabilities and expectations are taken over the randomness of the choice of (which is assumed to be drawn iid from ) and the internal probability of .
Given , , a distribution , and an adversary , if a sequence of mechanisms where , is --LSS with respect to , then
Given , , a distribution , and an adversary , if a sequence of mechanisms where is --Sample Accurate with respect to