Bayesian Adaptive Data Analysis Guarantees from Subgaussianity

10/31/2016 ∙ by Sam Elder, et al. ∙ 0

The new field of adaptive data analysis seeks to provide algorithms and provable guarantees for models of machine learning that allow researchers to reuse their data, which normally falls outside of the usual statistical paradigm of static data analysis. In 2014, Dwork, Feldman, Hardt, Pitassi, Reingold and Roth introduced one potential model and proposed several solutions based on differential privacy. In previous work in 2016, we described a problem with this model and instead proposed a Bayesian variant, but also found that the analogous Bayesian methods cannot achieve the same statistical guarantees as in the static case. In this paper, we prove the first positive results for the Bayesian model, showing that with a Dirichlet prior, the posterior mean algorithm indeed matches the statistical guarantees of the static case. The main ingredient is a new theorem showing that the Beta(α,β) distribution is subgaussian with variance proxy O(1/(α+β+1)), a concentration result also of independent interest. We provide two proofs of this result: a probabilistic proof utilizing a simple condition for the raw moments of a positive random variable and a learning-theoretic proof based on considering the beta distribution as a posterior, both of which have implications to other related problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of adaptive data analysis is motivated by the common practice in machine learning of data reuse. The potential problems with this were perhaps first illustrated by Freedman in 1983 [6]

, who showed that using the same data to select regressors and subsequently fit a general linear model to those regressors would lead to wildly inaccurate estimates of the goodness of fit, at least when the number of regressors and data points were on the same order of magnitude.

The simplest solution to such problems is to collect a fresh set of data each time the overall model is updated, an approach known as sample splitting. But in many scenarios, this is a rather expensive solution, especially for data practices that utilize multiple such adaptive rounds. Stated quantitatively, this requires a sample complexity that scales linearly with the number of measurements of the data.

The aim of adaptive data analysis, as introduced by Dwork, Feldman, Hardt, Pitassi, Reignold and Roth [4] is to reduce this dependence while preserving standard statistical guarantees by introducing a protective layer of interaction between the analyst and the data. To describe this challenge in a mathematical framework, they introduced a two-player game between an analyst and a curator.

In the original model, the analyst is seeking to answer some number of statistical queries about a distribution on a universe , which are expressed as the averages of bounded functions of the data. For instance, if

, this is asking the curator to estimate a probability, known as a

counting query. The curator receives independent samples from and is seeking to answer every query to within an additive error on the population the data is drawn from, with probability . The central question is how many samples the curator requires to achieve this accuracy.

If the queries are specified in one batch in advance of the analyst receiving the answers, this is known as static data analysis, and samples are both necessary (on certain problems) and sufficient. In this case, the curator can simply answer with the empirical means of the functions: , where are the data points he receives. The full framework of adaptive data analysis, where later queries are allowed to depend on previous answers, is much less well-understood and the subject of this line of investigation.

As we previously argued in [5], though, the hardest problems introduced in this framework tend not to reflect the reality it is attempting to model. In fact, since the analyst is taken to be adversarial and worst-case, we could suppose that the analyst already knows the distribution ! To avoid such scenarios and hopefully better match reality, we introduced a Bayesian variant of the problem, where we enforce that the curator and analyst have the same information via a common accurate Bayesian prior on .

The first question, which we addressed in [5], is what sorts of adaptive analyst strategies could cause problems for the curator without utilizing side information. The main result in that work was a problem on which a large class of curator algorithms would fail. This problem had two components: A difficult problem based on error-correcting codes that would produce high posterior uncertainty in some direction and an augmentation technique to allow the analyst to extract information from these curator algorithms using slightly correlated queries.

Quantitatively, as previously discussed, sample splitting methods can trivially achieve a linear dependence: . The original dimension-free work of DFHPRR improved this to , but this is far short of the static bound cited above, . In [5], we ultimately demonstrated a high-dimensional problem on which a large class of curator algorithms require about .

In this paper, we demonstrate the first positive results in Bayesian adaptive data analysis. In particular, we show that if the common prior is a Dirichlet distribution, the most natural Bayesian curator algorithm, the posterior mean, achieves the same level of accuracy against adaptive queries as static queries: .

This positive result is also rather different than previous positive claims. All previously proposed techniques in this area, from the differential privacy-based techniques introduced in DFHPRR [4], to follow-up work restricting to the special case of a machine learning competition leaderboard by Blum and Hardt [1], to the wide class of techniques analyzed in [5], focus on attempting to obfuscate the answers in order to prevent the analyst from overfitting to the data. By contrast, we show in this paper that under a Dirichlet prior, the curator can achieve the static guarantee even without any obfuscation!

To prove such a result, the curator must be confident in answers to any possible query. Our main tool for proving such confidence will be the probabilistic notion of subgaussianity. In section 2, we recall the definition and key properties of this strong notion of concentration. We then prove the main technical result in section 3 in two ways, showing that the distribution is

-subgaussian. The first method consists of bounding the moment generating function in terms of the raw moments of the distribution, which are easy to compute. The second method is simpler and stronger, but more mysteriously requires one to consider the beta distribution as a Bayesian posterior. In section

4, we show how this implies that the curator is correct on all queries given a Dirichlet prior.

We then discuss extensions of these main results in section 5

. We first generalize the probabilistic approach on the beta distribution to a simple condition for (upper) subgaussianity on the raw moments of a nonnegative random variable. We also discuss several other conjugate priors, where empirical evidence suggests that the corresponding results are true. Then we consider the second learning-theoretic approach, and show that a series of simple conditions on the posterior mean evolution suffice, which helps to explain what makes the Dirichlet prior this friendly to adaptivity. Finally, in section

6, we survey the implications for adaptive data analysis and offer directions for future work.

2 Subgaussianity and Query Accuracy

A random variable with zero mean is said to be -subgaussian if for all , . Here, is also known as the variance proxy. Following the seminal work on subgaussian random variables [3], define . Recall the following: Some basic facts about subgaussian random variables.

  1. As the name suggests, the variance gives a lower bound on the subgaussian variance proxy: . If the two are equal, is said to be strictly subgaussian.

  2. (See [3] Theorem 1.2) The space of subgaussian random variables is a Banach space with respect to the norm . That is, it has the right scaling, for all , satisfies the triangle inequality and is complete on the space of subgaussian random variables.

  3. (See [3] Theorem 1.3) is -subgaussian if for all integers ,

    If is symmetric (i.e. and

    have the same distributions, so all odd moments are zero), the factor of

    can be dropped.

If , we will abuse notation slightly and write . That is, we will consider random variables with nonzero mean to be -subgaussian if their centered versions are. Note that in this context, conditions like in Proposition 2.3 apply to the centered moments .

Now, let us return to the Bayesian adaptive data analysis problem and consider a single query for a moment, setting . In this case, the static bound states that samples are sufficient for estimation of to additive error , with probability . We will show that for this particular relationship between and , this follows from a subgaussianity property on the posterior.

If the curator’s posterior distribution is -subgaussian with respect to every query, then the posterior mean-answering curator answers correctly and achieves the static sample complexity of . Before proving the proposition, let us explicate this condition. Each query projects every possible population to a value . Therefore, projects every possible posterior distribution on populations to a distribution on . The assumption is that this projected distribution of the curator’s posterior is -subgaussian.

The claim is that as long as this condition holds, the curator doesn’t care what queries the analyst asks; the posterior mean will be accurate on all of them. Indeed, the curator could actually release the entire posterior mean (not just its value on every query), giving the analyst everything she will ever learn from his answers. If we can show that the probability of error on any query is less than , then a union bound will give a total error probability at most , no matter which queries are asked.

Proof.

Suppose the curator’s posterior distribution with respect to a given query is -subgaussian. Equivalently stated, suppose the error of the posterior mean is a (centered) -subgaussian random variable. By Markov’s inequality, setting ,

We can also prove an identical bound for , and therefore, the error probability satisfies

Taking , we have shown the required bound.∎

This is rather remarkable; it says that the curator can do just as well against adaptive queries as against static queries if his posterior has this concentration property. Of course, such a concentration property will not hold in every case; for instance, it is far from true for the posteriors in the series of examples considered in [5]. But in cases where it does hold, like the Dirichlet prior and posterior we will investigate shortly, the curator doesn’t have to do any obfuscation.

As an added bonus, the subgaussian framework also simplifies the set of queries we must consider: If the curator’s posterior distribution is -subgaussian with respect to every counting query, then the posterior mean curator answers correctly and achieves the static sample complexity of .

Proof.

The key here is Proposition 2.2. Since is a norm, convex combinations of -subgaussian random variables will also be -subgaussian:

Recall that counting queries only have values or , and therefore form the vertices in the hypercube of possible queries . All other queries are convex combinations of these111If is infinite, then some queries might not be convex combinations of finitely many counting queries, but we can find a sequence of finite convex combinations converging to them in -norm simply by picking more possible values for the function in at each step. This will converge in -norm, so the completeness of from Proposition 2.2 shows that they must also have the same bound on their subgaussian norm., and therefore will be -subgaussian as well (with no loss in the constant). Therefore, by Proposition 2, the posterior mean curator wins on all statistical queries if he wins on counting queries.∎

3 Beta Priors on Bernoulli Trials

The first posterior distribution we will investigate is the ubiquitous Beta distribution. We will see in the next section that this distribution is also the projection of the Dirichlet distribution on any counting query.

The beta distribution is a continuous distribution on with density

where is the gamma function, satisfying for . The fraction in the density formula above is simply a normalization constant. Notice that for

, this is the uniform distribution on

. For , this converges to the Rademacher 0-1 random variable, . Finally, the (raw) moments of the beta distribution are given by

(1)

where is a rising factorial. In particular, the mean and variance are given by

To see the beta distribution as a posterior, consider a Bernoulli trial, a single event with two possible outcomes: success or failure. If our prior over the probability of success is given by , then upon receiving successes and failures, a Bayesian update yields a posterior proportional to

so after renormalizing, it must be the distribution. This is what is meant by calling the family of beta distributions a conjugate prior: If the prior is a beta distribution, the posterior will be as well.

Surprisingly, despite substantial focus including the entire Handbook of Beta Distribution and its Applications [7], the following concentration result does not appear to be known: The beta distribution is -subgaussian. Numerical data suggests that

That is, appears to be maximized for fixed when , where it appears to be strictly subgaussian. Since the variance is a lower bound for (Proposition 2.1), we conclude that Theorem 3 is tight up to a factor of , seen as a function of . However, numerical data is very clear that the lower bound is tight: The beta distribution is -subgaussian. Before proving Theorem 3

, let’s see what this concentration result means for a nearly trivial case of Bayesian adaptive data analysis. If the prior on the parameter of a Bernoulli distribution is given by

for any , the posterior mean curator strategy wins.

Proof.

If the prior is , the posterior is with .

With a Bernoulli trial, there are two nontrivial counting queries, the probabilities of success and failure, or and . The latter is an affine function of , so by Propositions 2.2 and 2 it suffices to show that is -subgaussian.

Indeed, Theorem 3 shows that the posterior is -subgaussian, so by Proposition 2, the posterior mean curator strategy wins.∎

Thus, we see that as a conjugate family, the beta distributions have an interesting property: Only some beta distributions can be posteriors after observing and updating on data points. All such posteriors have , and Theorem 3 says that they must concentrate to the degree we require.

3.1 Probabilistic Proof

We will provide two very different proofs of results like Theorem 3. We’ll start with the probabilistic result, which actually can only produce a weaker constant: The beta distribution is -subgaussian.

Proof.

Notice that technically, this is a claim about the centered beta distribution, or . However, we do not prove this claim from the centered moments, but from the raw moments themselves. That is, rather than proving

by expanding termwise in , we move the mean term to the other side and show

(2)

by expanding termwise in . The coefficients of the left side are the raw moments given in (1). To bound the terms in this expansion, we will need the following technical lemma: For any nonnegative integer and ,

Before proving this lemma, let us see how it implies Theorem 3. First, we consider the even terms:

In going from the first to the second line, we have replaced the pairs of consecutive terms by applying Lemma 3.1. From the second to the third, we have expanded the product and grouped terms, upper bounding the sum of all products of of the right fractions by times the largest of them. From the third to the fourth lines, we have expanded and doubled the terms in the numerator to interleave with the odd terms, adding an extra factor of 2 to the fraction raised to the power.

The odd terms are similar, except we pair the terms up in the moment starting from the second.

The only other major difference is that between the third and fourth lines, we also replaced the numerators of with the corresponding larger values of . Therefore, for , we have shown that all of the terms of the left side of (2) are bounded by the corresponding terms of the right side.

To prove (2) for , we utilize the symmetry of the beta distribution: . Therefore, we’ve in fact also shown that for ,

Dividing both sides by , we immediately get

so the desired bound holds for as well. We are done, apart from proving the technical lemma.∎

Proof of Lemma 3.1.

We induct on , starting with two base cases: and . For , we have

where we used the AM-GM inequality . Now, if , we have

as desired. For the inductive step, take . Then

where we applied the inductive hypothesis to get the last line.∎

Unfortunately, it doesn’t appear possible to squeeze an additional factor of 2 out of this method. In fact, (2) is not true termwise if we replace the denominator with .222For a concrete example, when and , the terms have coefficients of , respectively, the opposite direction as needed to prove (2). However, the coefficients of and correct for this term, and is still -subgaussian. Therefore any proof of Conjecture 3 will have to use a different method.

3.2 Learning-Theoretic Proof

Quite surprisingly, we can get a simpler and stronger result by considering the beta distribution as a Bayesian posterior. The key is Azuma’s Inequality: [Azuma’s Inequality (c.f. [9] Lemma 3.7)]Let be a martingale adapted to the filtration such that for all , is -subgaussian. Then is -subgaussian.

Proof of Theorem 3.

The martingale we will construct is surprisingly related to our problem: Let be the prior over the parameter of a Bernoulli random variable, let be the updated information upon receiving samples from the random variable, and let be the resulting posterior mean. Then as , approaches the true parameter of the random variable almost surely, so approaches the error of the original posterior mean.

Moreover, it is fairly elementary to check that is a martingale with the appropriate subgaussian variance proxy. Suppose that the posterior after the first data points is , where of course . Then the posterior mean . Conditioning on the samples the curator has seen so far, the next sample will be a success with probability and a failure with probability . That is:

This difference is clearly mean-zero, so is indeed a martingale. As a centered binary random variable, by Theorem 3.1 in [2], the subgaussian variance proxy of is given by

In particular (Lemma 2.1 in [2]), for all , so is -subgaussian.

Therefore, by Azuma’s inequality, is subgaussian with variance proxy

as desired.333The final steps of this analysis are also tight; .

Impressively, this method is able to prove a strictly stronger result than the probabilistic approach, matching the correct coefficient on for the symmetric case.

4 Dirichlet Priors on Categorical Random Variables

Of course, the example considered in Corollary 3 is a nearly trivial example of Bayesian adaptive data analysis: There is (essentially) only one possible query, making adaptivity meaningless. However, the result fortunately generalizes to a much more useful framework: Dirichlet priors on categorical random variables.

Recall that a categorical random variable has support for some positive integer , and probabilities for each value. corresponds to a Bernoulli trial again, but if

, there are many possible queries, each corresponding to a vector

and asking for the dot product .

The conjugate prior for the categorical random variable is the Dirichlet distribution

, the natural generalization of the beta distribution. Its probability density function is given by

Therefore, upon receiving data with counts of category , the posterior is given by , making the Dirichlet distribution a conjugate prior for the categorical distribution.

In the direction of any query vector , the Dirichlet distribution is -subgaussian. In exactly the same way, this guarantees accuracy of the posterior mean: If the prior on the parameter of a categorical random variable is for any , the posterior mean curator strategy wins.

Proof of Theorem 4.

By Proposition 2, it suffices to check this for counting queries, or . We will show that the distribution of the Dirichlet prior with respect to such queries is merely a beta distribution,444This fact might be well-known, but it also isn’t hard to prove. It’s at least well-known that the marginals in each category are beta distributions, but those only cover the case where only one . and apply Theorem 3.

By relabeling coordinates, we can suppose that and for some (if all or all the dot product is always or respectively). We first transform the simplex of possible in a fairly common way by considering the partial sums . Then the simplex is given by

In this notation, corresponds to . The probability density of at is therefore proportional to the -dimensional volume with respect to the Dirichlet distribution of this slice, or

where we have substituted for and for . Pulling out the factors of and , the remainder no longer depends on . After normalizing, this is exactly the distribution. Therefore, by Theorem 3, this distribution is -subgaussian, as desired.∎

5 Discussion

These results can be extended in multiple directions. First, we examine the core technique, and then we examine other potential conjugate priors.

5.1 Subgaussianity from Raw Moments

The proof technique used in Theorem 3 is perhaps the most generally useful contribution of this paper. Most proofs that a random variable is subgaussian involve showing a bound on its centered moments like in Proposition 2.3. However, this result only used the raw moments. In fact, this is all we need: If is a random variable with positive raw moments satisfying

(3)

for every nonnegative integer , then for all , . Note that for , this condition says that .

Proof.

We simply mimic the proof of Theorem 3. Since all moments are positive, we can apply the assumed inequality to each term of a telescoping product:

The odd terms are also the same:

Summing all of the terms, which are positive since , therefore yields the desired inequality.∎

For ease of discussion, this definition will be helpful: A random variable is -upper subgaussian if for all ,