Technological advances are continuously driving down the cost of data collection and storage. Data collection devices such as smartphones and wearable health monitors have become ubiquitous, resulting in continuous accumulation of new data. This means that the statistical properties of a database may evolve dramatically over time, and earlier analysis of a database may grow stale. For example, tasks like identifying trending news topics critically rely on dynamic data and dynamic analysis. To harness the value of growing databases and keep up with data analysis needs, guarantees of machine learning algorithms and other statistical tools must apply not just to fixed databases but also to dynamic databases.
Learning algorithms must deal with highly personal data in contexts such as wearable health data, browsing behavior, and GPS location data. In these settings, privacy concerns are particularly important. Analysis of sensitive data without formal privacy guarantees has led to numerous privacy violations in practice, for tasks such as recommender systems [CKN11], targeted advertising [Kor10], data anonymization [NS08]
, and deep learning[HAPC17].
In the last decade, a growing literature on differential privacy has developed to address these concerns (see, e.g., [DR14]). First defined by [DMNS06], differential privacy gives a mathematically rigorous worst-case bound on the maximum amount of information that can be learned about any one individual’s data from the output of an algorithm. The theoretical computer science community has been prolific in designing differentially private algorithms that provide accuracy guarantees for a wide variety of machine learning problems (see [JLE14] for a survey). Differentially private algorithms have also begun to be implemented in practice by major organizations such as Apple, Google, Uber, and the United Status Census Bureau. However, the vast majority of differentially private algorithms are designed only for static databases, and are ill-equipped to handle new environments with growing data.
This paper presents a collection of tools for machine learning and other types of data analysis that guarantee differential privacy and accuracy as the underlying databases grow arbitrarily large. We give both a general technique and a specific algorithm for adaptive analysis of dynamically growing databases. Our general technique is illustrated by two algorithms that schedule black box access to some algorithm that operates on a fixed database, to generically transform private and accurate algorithms for static databases into private and accurate algorithms for dynamically growing databases. Our specific algorithm directly adapts the private multiplicative weights algorithm of [HR10] to the dynamic setting, maintaining the accuracy guarantee of the static setting through unbounded data growth.
1.1 Our Results
Here we outline our two sets of results for adaptive analysis of dynamically growing databases. Throughout the paper, we refer to the setting in which a database is fixed for the life of the analysis as the static setting, and we refer to the setting in which a database is accumulating new data entries while the analysis is ongoing as the dynamic setting.
Our first set of results consists of two methods for generically transforming a black box algorithm that is private and accurate in the static setting into an algorithm that is private and accurate in the dynamic setting. BBScheduler reruns the black box algorithm every time the database increases in size by a small multiplicative factor, and it provides privacy and accuracy guarantees that are independent of the total number of queries (Theorem 10). BBScheduler calls each successive run of the black box algorithm with an exponentially shrinking privacy parameter to achieve any desired total privacy loss. The time-independent accuracy guarantee arises from the calibration of the decreasing per-run privacy parameters with the increasing database size. We instantiate this scheduler using the SmallDB algorithm [BLR08] for answering linear queries on a database of size over a universe of size . With our scheduler we can answer an infinite amount of queries from a linear query class of size on a growing database with starting size over universe of size . The static and dynamic settings have the following respective accuracy guarantees (Theorem 7, [BLR08]; Theorem 9):
Our second transformation, BBImprover
, runs the black box every time new entries are added to the database, and it yields accuracy guarantees that improve as more data accumulate. This algorithm is well-suited for problems where data points are sampled from a distribution, where one would expect the accuracy guarantees of static analysis to improve with the size of the sample. We apply this scheduler to private empirical risk minimization (ERM) algorithms to output classifiers with generalization error that improves as the training database grows (Theorem12).
The following informal theorem statement summarizes our results for BBScheduler (Theorem 10) and BBImprover (Theorem 13). These results show that almost any private and accurate algorithm can be rerun at appropriate points of data growth with minimal loss of accuracy. Throughout the paper, we use to denote the starting size of the database. The below hides terms, and we suppress dependence on parameters other than and , (e.g., data universe size , number of queries
, failure probability). Section 3 also provides an improved -private version of BBScheduler.
Theorem 1 (Informal).
Let be an -differentially private algorithm that is -accurate for an input query stream for some and constant . Then
BBScheduler running is -differentially private and -accurate for .
BBImprover running is -differentially private and is -accurate for , where bounds the error when the database is size .
Our second set of results opens the black box to increase accuracy and adaptivity by modifying the private multiplicative weights (PMW) algorithm [HR10], a broadly useful algorithm for privately answering an adaptive stream of linear queries with accuracy . Our modification for growing databases (PMWG) considers all available data when any query arrives, and it suffers asymptotically no additional accuracy cost relative to the static setting.
The static PMW algorithm answers an adaptive stream of queries while maintaining a public histogram reflecting the current estimate of the database given all previously answered queries. It categorizes incoming queries as either easy or hard, suffering significant privacy loss only for the hard queries. Hardness is determined with respect to the public histogram: upon receipt of a query for which the histogram provides a significantly different answer than the true database, PMW classifies this as a hard query, and it updates the histogram in a way that moves it closer to a correct answer on that query. The number of hard queries is bounded using a potential argument. Potential is defined as the relative entropy between the database and the public histogram. This quantity is initially bounded, it decreases by a substantial amount after every hard query, and it never increases.
The main challenge in adapting PMW to the dynamic setting is that we can no longer use this potential argument to bound the number of hard queries. This is because the relative entropy between the database and the public histogram can increase as new data arrive. In the worst case, PMW can learn the database with high accuracy (using many hard queries), and then adversarial data growth can change the composition of the database dramatically, allowing for even more hard queries on the new data than is possible in the static setting. Instead, we modify PMW so that the public histogram updates not only in response to hard queries but also in response to new data arrivals. By treating the new data as coming from a uniform distribution, these latter updates incur no additional privacy loss, and they mitigate the relative entropy increase due to new data. This modification allows us to maintain the accuracy guarantee of the static setting through unbounded data growth. The following informal theorem is a statement of our main result for PMWG (Theorem14). As with the static PMW algorithm, we can improve the exponent in the bound to if our goal is -privacy for (Theorem 17).
Theorem 2 (Informal).
PMWG is -differentially private and -accurate for any stream of up to queries when the database is any size for some .
Along the way, we develop extensions of several static differentially private algorithms to the dynamic setting. These algorithms are presented in Appendix C, and may be of independent interest for future work on the design of differentially private algorithms for growing databases.
1.2 Related Work
Online Learning. Our setting of dynamically growing databases is most closely related to online learning, where a learner plays a game with an adversary over many rounds. On each round , the adversary first gives the learner some input, then the learner chooses an action
and receives loss functionchosen by the adversary, and experiences loss . There is a vast literature on online learning, including several works on differentially private online learning [JKT12, ST13, AS17]. In those settings, a database is a sequence of loss functions, and neighboring databases differ on a single loss function. While online learning resembles the dynamic database setting, there are several key differences. Performance bounds in the online setting are in terms of regret, which is a cumulative error term. On the other hand, we seek additive error bounds that hold for all of our answers. Such bounds are not possible in general for online learning, since the inputs are adversarial and the true answer is not known. In our case, we can achieve such bounds because even though queries are presented adversarially, we have access to the query’s true answer. Instead of a cumulative error bound, we manage a cumulative privacy budget.
Private Learning on a Static Database. There is a prominent body of work designing differentially private algorithms in the static setting for a wide variety of machine learning problems (see [JLE14] for a survey). These private and accurate algorithms can be used as black boxes in our schedulers BBScheduler and BBImprover. In this paper, we pay particular attention to the problem of private empirical risk minimization (ERM) as an instantiation for our algorithms. Private ERM has been previously studied by [CMS11, KST12, BST14]; we compare our accuracy bounds in the dynamic setting to their static bounds in Table 1.111To get the static bounds, we use Appendix D of [BST14], which converts bounds on expected excess empirical risk to high probability bounds.
Private Adaptive Analysis of a Static Database. If we wish to answer multiple queries on the same database by independently perturbing each answer, then the noise added to each answer must scale linearly with the number of queries to maintain privacy, meaning only queries can be answered with meaningful privacy and accuracy. If the queries are known in advance, however, [BLR08] showed how to answer exponentially many queries relative to the database size for fixed and . Later, Private Multiplicative Weights (PMW) [HR10] achieved a similar result in the interactive setting, where the analyst can adaptively decide which queries to ask based on previous outputs, with accuracy guarantees close to the sample error. A recent line of work [DFH15, CLN16, BNS16] showed deep connections between differential privacy and adaptive data analysis of a static database. Our results would allow analysts to apply these tools on dynamically growing databases.
Private Non-Adaptive Analysis of a Dynamic Database. Differential privacy for growing databases has been studied for a limited class of problems. Both [DNPR10] and [CSS11] adapted the notion of differential privacy to streaming environments in a setting where each entry in the database is a single bit, and bits arrive one per unit time. [DNPR10] and [CSS11] design differentially private algorithms for an analyst to maintain an approximately accurate count of the number 1-bits seen thus far in the stream. This technique was later extended by [ST13]
to maintain private sums of real vectors arriving online in a stream. We note that both of these settings correspond to only a single query repeatedly asked on a dynamic database, precluding meaningful adaptive analysis. To contrast, we consider adaptive analysis of dynamically growing databases, allowing the analyst exponentially many predicate queries to choose from as the database grows.
All algorithms in this paper take as input databases over some fixed data universe of finite size . Our algorithms and analyses represent a finite database equivalently as a fractional histogram , where is the fraction of the database of type . When we say a database has size , this means that for each there exists some such that .
If an algorithm operates over a single fixed database, we refer to this as the static setting. For the dynamic setting, we define a database stream to be a sequence of databases starting with a database of size at time and increasing by one data entry per time step so that always denotes both a time and the size of the database at that time. Our dynamic algorithms also take a parameter , which denotes the starting size of the database.
We consider algorithms that answer real-valued queries with particular focus on linear queries. A linear query assigns a weight to each entry depending on its type and averages these weights over the database. We can interpret a linear query as a vector and write the answer to the query on database as , , or , depending on context. For viewed as a vector, denotes the th entry. We note that an important special case of linear queries are counting queries, which calculate the proportion of entries in a database satisfying some boolean predicate over .
Many of the algorithms we study allow queries to be chosen adaptively, i.e., the algorithm accepts a stream of queries where the choice of can depend on the previous queries and answers. For the dynamic setting, we doubly index a stream of queries as so that denotes the size of the database at the time is received, and indexes the queries received when the database is size .
The algorithms studied produce outputs of various forms. To evaluate accuracy, we assume that an output of an algorithm for query class (possibly specified by an adaptively chosen query stream) can be interpreted as a function over , i.e., we write to denote the answer to based on the mechanism’s output. We seek to develop mechanisms that are accurate in the following sense.
Definition 1 (Accuracy in the static setting).
For , an algorithm is -accurate for query class if for any input database , the algorithm outputs such that for all with probability at least .
In the dynamic setting, accuracy must be with respect to the current database, and the bounds may be parametrized by time.
Definition 2 (Accuracy in the dynamic setting).
For and , an algorithm is -accurate for query stream if for any input database stream , the algorithm outputs such that for all with probability at least .
2.1 Differential Privacy and Composition Theorems
Differential privacy in the static setting requires that an algorithm produce similar outputs on neighboring databases , which differ by a single entry. In the dynamic setting, differential privacy requires similar outputs on neighboring database streams that satisfy that for some , for and for .222Note that this definition is equivalent to the definition of neighboring streams in [CPWV16]. In the definition below, a pair of neighboring inputs refers to a pair of neighboring databases in the static setting or a pair of neighboring database streams in the dynamic setting.
Definition 3 (Differential privacy [Dmns06]).
For , an algorithm is -differentially private if for any pair of neighboring inputs and any subset ,
When , we will say that is -differentially private.
We note that in the dynamic setting, an element in is an entire (potentially infinite) transcript of outputs that may be produced by .
Differential privacy is typically achieved by adding random noise that scales with the sensitivity of the computation being performed. The sensitivity of any real-valued query is the maximum change in the query’s answer due to the change of a single entry in the database, denoted . We note that a linear query on a database of size has sensitivity .
The following composition theorems quantify how the privacy guarantee degrades as additional computations are performed on a database.
Theorem 3 (Basic composition [Dmns06]).
Let be an -differentially private algorithm for all . Then the composition defined as is -differentially private for .
Theorem 3 is useful to combine many differentially private algorithms to still achieve -differential privacy. Assuming the privacy loss in each mechanism is the same, the privacy loss from composing mechanisms scales with . There is an advanced composition theorem due to [DRV10] that improves the privacy loss to roughly by relaxing from -differential privacy to -differential privacy. However, advanced composition does not extend cleanly to the case where each has different . Instead we use a composition theorem based on concentrated differential privacy (CDP) of [BS16]. This gives us the flexibility to compose differentially private mechanisms with different to achieve -differential privacy, where scales comparably to the bound of advanced composition.
Theorem 4 (CDP composition, Corollary of [Bs16]).
Let be a -differentially private algorithm for all . Then the composition of all is -differentially private for . In particular, for and , we have .
The statement follows from the three following propositions in [BS16]:
A mechanism that is -DP is -zCDP.
Composition of -zCDP and -zCDP is a ()-zCDP mechanism
A -zCDP mechanism is -DP for any .
Theorem 4 shows that composing -differentially private algorithms results in -differential priacy, where the privacy scales with the -norm of the vector and .
2.2 Empirical Risk Minimization
Empirical risk minimization (ERM) is one of the most fundamental tasks in machine learning. The task is to find a good classifier from a set of classifiers , given a database of size sampled from some distribution over and loss function . The loss of a classifier on a finite database with respect to some is defined as . Common choices for include loss, hinge loss, and squared loss.
We seek to find a with small excess empirical risk, defined as,
In convex ERM, we assume that is convex for all and that is a convex set. We will also assume that . Convex ERM is convenient because finding a suitable reduces to a convex optimization problem, for which there exist many fast algorithms. Some examples of ERM include finding a -dimensional median and SVM.
ERM is useful due to its connections to the true risk, also known as the generalization error, defined as . That is, the loss function will be low in expectation on a new data point sampled from . We can also define the excess risk of a classifier :
ERM finds classifiers with low excess empirical risk, which in turn often have low excess risk. The following theorem relates the two. For completeness, we first give some definitions relating to convex empirical risk minimization. A convex body is a set such that for all and all , . A vector is a subgradient of a function at if for all , . A function is -Lipschitz if for all pairs , . is -strongly convex on if for all and all subgradients at and all , we have . is -smooth on if for all , for all subgradients at and for all , we have . We denote the diameter of a convex set by .
Theorem 5 ([Sssss09]).
For -Lipschitz and -strongly convex loss functions, with probability at least over the randomness of sampling the data set , the following holds:
Moreover, we can generalize this result to any convex and Lipschitz loss function by defining a regularized version of , called , such that . Then is -Lipschitz and -strongly convex. Also note that:
Thus, ERM finds classifiers with low true risk in these settings. The following result for differentially private static ERM is due to [BST14] and provides a baseline for our work in the dynamic setting.
Theorem 6 (Static ERM [Bst14]).
There exists an algorithm ERM for that is -differentially private and -accurate for static ERM as long as , is 1-Lipschitz, and for sufficiently large constant ,
The SmallDB algorithm [BLR08] is a differentially private algorithm for generating synthetic databases. For any input database of size , class of linear queries, and accuracy parameter , the algorithm samples a database of size with exponential bias towards databases that closely approximate on all the queries in . The main strength of SmallDB is its ability to accurately answer exponentially many linear queries while still preserving privacy, captured in the following guarantee.
Theorem 7 (Static SmallDB [Blr08]).
The algorithm SmallDB() is -differentially private, and it is -accurate for linear query class of size as long as for sufficiently large constant ,
This bound on shows that for a fixed accuracy goal, the privacy parameter can decrease proportionally as the size of the input database size increases.
2.4 Private Multiplicative Weights
The static private multiplicative weights (PMW) algorithm [HR10] answers an adaptive stream of linear queries while maintaining a public histogram , which reflects the current estimate of the static database given all previously answered queries. Critical to the performance of the algorithm is that it uses the public histogram to categorize incoming queries as either easy or hard, and it updates the histogram after hard queries in a way that moves it closer to a correct answer on that query. The number of hard queries is bounded using a potential argument, where the potential function is defined as the relative entropy between the database and the public histogram, i.e., . This quantity is initially bounded, it decreases by a substantial amount after every hard query, and it never increases. The following guarantee illustrates that this technique allows for non-trivial accuracy for exponentially many adaptively chosen linear queries.333The bounds cited here are from the updated version in http://mrtz.org/papers/HR10mult.pdf
Theorem 8 (Static PMW [Hr10]).
The algorithm PMW() is -differentially private, and it is -accurate for adaptively chosen linear queries as long as for sufficiently large constant
This result is nearly tight in that any -differentially private algorithm that answers adaptively chosen linear queries on a database of size must have error [HR10]. PMW runs in time linear in the data universe size . If the incoming data entries are drawn from a distribution that satisfies a mild smoothness condition, a compact representation of the data universe can significantly reduce the runtime [HR10]. The same idea applies to our modification of PMW for the dynamic setting presented in Section 4, but we only present the inefficient and fully general algorithm.
3 Extending Accuracy Guarantees to Growing Databases
In this section, we give two schemes for answering a stream of queries on a growing database, given black box access to a differentially private algorithm for the static setting. Our results extend the privacy and accuracy guarantees of these static algorithms to the dynamic setting, even when data growth is unbounded. We also instantiate our results with important mechanisms for machine learning that are private in the static setting.
In Section 3.1, we provide an algorithm BBScheduler for scheduling repeated runs of a static algorithm. BBScheduler is differentially private and provides -accurate answers to all queries, for that does not change as the database grows or as more queries are asked. In Section 3.2, we provide a second algorithm BBImprover that allows the accuracy guarantee to improve as more data accumulate. This result is well-suited for problems where data points are sampled from a distribution, where one would expect the accuracy guarantees of static analysis to improve with the size of the sample. This algorithm is differentially private and -accurate, where is diminishing inverse polynomially in (i.e., approaching perfect accuracy as the database grows large). We also instantiate our results with important mechanisms for machine learning that are private in the static setting.
For ease of presentation, we restrict our results to accuracy of real-valued queries, but the algorithms we propose could be applied to settings with more general notions of accuracy or to settings where the black box algorithm itself can change across time steps, adding to the adaptivity of this scheme.
3.1 Fixed Accuracy as Data Accumulate
In this section, we give results for using a private and accurate algorithm for the static setting as a black box to solve the analogous problem in the dynamic setting. Our general purpose algorithm BBScheduler treats a static algorithm as a black box endowed with privacy and accuracy guarantees, and it reruns the black box whenever the database grows by a small multiplicative factor. For concreteness, we first show in Section 3.1.1 how our results apply to the case of the well-known SmallDB algorithm, described in Section 2.3. Then in Section 3.1.2, we present the more general algorithm.
3.1.1 Application: SmallDB for Growing Databases
Before presenting our result in full generality, we instantiate it on SmallDB for concreteness, and show how to extend SmallDB to the dynamic setting. Recall from Section 2.3 that the static SmallDB algorithm takes in a database , a class of linear queries , and privacy parameter , and accuracy parameters , . The algorithm is -differentially private and outputs a smaller database of size , from which all queries in can be answered with -accuracy.
In the dynamic setting, we receive a database stream , a stream of queries from some class of linear queries , parameters , and starting database size . We still require -differential privacy and -accuracy on the entire stream of queries, for that remains fixed as the database grows.
We design the SmallDBG algorithm that works by running SmallDB at times , where for some chosen by the algorithm.444For simplicity, we will assume that is integral for for all . We can replace with and achieve the same bounds up to a small sub-constant additive factor. We will label the time interval from to as the epoch. At the start of the epoch, we call SmallDB on the current database with privacy parameter , and output a synthetic database that will be used to answer queries received during epoch .555Note that SmallDBG will still give similar guarantees if the query class changes over time, provided that the black box SmallDB at time uses the correct query class for times to . We could think of this as SmallDBG receiving a SmallDB() as its black box in epoch . SmallDBG provides the following guarantee:
SmallDBG() is -differentially private and can answer all queries in query stream from query class of size with -accuracy666With a more careful analysis, one can show that the numerator in this accuracy bound can be taken to be to match the form of the bound in Theorem 7. for sufficiently large constant and
Note that there is no bound on the number of queries or on the database growth. The algorithm can provide answers to an arbitrary number of linear queries at any time.
There are two key technical properties that allow this result to hold. First, each data point added to a database of size can only change a linear query by roughly . Thus, using synthetic database from time for queries before time will incur extra additive error of at most . Second, since the ’s grow by a multiplicative factor each time, the epochs become exponentially far apart and the total privacy loss (due to composition of multiple calls of SmallDB) is not too large.
3.1.2 A General Black Box Scheduler
The results for SmallDBG are an instantiation of a more general result that extends the privacy and accuracy guarantees of any static algorithm to the dynamic setting. Our general purpose algorithm BBScheduler treats a static algorithm as a black box endowed with privacy and accuracy guarantees, and reruns the black box whenever the database grows by a factor of . Due to the generality of this approach, BBScheduler can be applied to any algorithm that satisfies -differential privacy and -accuracy, as specified in Definition 4.
Definition 4 (-black box).
An algorithm is a -black box for a class of linear queries if it is -differentially private and with probability it outputs such that for every when for some that is independent of .
The parameter is intended to capture dependence on domain-specific parameters that affect the accuracy guarantee. For example, SmallDB is a -black box for an arbitrary set of linear queries, and its output is a synthetic database of size .
Our generic algorithm BBScheduler (Algorithm 1) will run the black box at times for with that depends on and . The call will have parameters and , and will use to answer queries received during the epoch, from to .
We now state our main result for BBScheduler:
Let be a -black box for query class . Then for any database stream and stream of linear queries over , BBScheduler() is -differentially private for and ()-accurate for sufficiently large constant and
Note that this algorithm can provide two different types accuracy bounds. If we desire -differential privacy, then the accuracy bounds are slightly weaker, while if we allow -differential privacy, we can get improved accuracy bounds at the cost of a small loss in privacy. The only differences are how the algorithm sets and . For a complete proof of Theorem 10, see Appendix A. We present a proof sketch below.
Proof sketch of Theorem 10.
BBScheduler inherits its privacy guarantee from the black box and the composition properties of differential privacy. When , we use Theorem 3 (Basic Composition). When , we use Theorem 4 (CDP Composition). These two cases require different settings of and for their respective composition theorems to yield the desired privacy guarantee.
To prove the accuracy of BBScheduler we require the following lemma, which bounds the additive error introduced by answering queries that arrive mid-epoch using the slightly outdated database from the end of the previous epoch.
For any linear query and databases and from a database stream , where for some ,
We combine this lemma with the algorithm’s choice of to show that with probability at least , all mid-epoch queries are answered -accurately with respect to the current database. The final step is to bound the overall failure probability of the algorithm. Taking a union bound over the failure probabilities in each epoch, we complete the proof by showing that .
3.2 Improving Accuracy as Data Accumulate
In the previous section, our accuracy bounds stayed fixed as the database size increased. However, in some applications it is more natural for accuracy bounds to improve as the database grows. For instance, in empirical risk minimization (defined in Section 2.2) the database can be thought of as a set of training examples. As the database grows, we expect to be able to find classifiers with shrinking empirical risk, which implies shrinking generalization error. More generally, when database entries are random samples from a distribution, one would expect accuracy of analysis to improve with more samples.
In this section, we extend our black box scheduler framework to allow for accuracy guarantees that improve as data accumulate. Accuracy improvements over BBScheduler are typically only seen once the database is sufficiently large. We first instantiate our result for empirical risk minimization in Section 3.2.1, and then present the general result in Section 3.2.2.
3.2.1 Application: Empirical Risk Minimization for Growing Databases
In the static setting, an algorithm for empirical risk minimization (ERM) takes in a database of size , and outputs a classifier from some set that minimizes a loss function on the sample data. Increasing the size of the training sample will improve accuracy of the classifier, as measured by excess empirical risk (Equation (2.1)). Given the importance of ERM, it is no surprise that a number of previous works have considered differentially private ERM in the static setting [CMS11, KST12, BST14].
For ERM in the dynamic setting, we want a classifier at every time that achieves low empirical risk on the current database, and we want the empirical risk of our classifiers to improve over time, as in the static case. Note that the dynamic variant of the problem is strictly harder because we must produce classifiers at every time step, rather than waiting for sufficiently many new samples to arrive. Releasing classifiers at every time step degrades privacy, and thus requires more noise to be added to preserve the same overall privacy guarantee. Nonetheless, we will compare our private growing algorithm, which simultaneously provides accuracy bounds for every time step from to infinity, to private static algorithms, which are only run once.
In ERMG, our algorithm for ERM in the dynamic setting, the sole query of interest is the loss function evaluated on the current database. At each time , ERMG receives a single query , where evaluated on the database is . The black box outputs , which is a classifier from that can be used to evaluate the single query . Our accuracy guarantee at time is the difference between and :
This expression is identical to the excess empirical risk defined in Equation (2.1). Thus accurate answers to queries are equivalent to minimizing empirical risk. Our accuracy bounds are stated in Theorem 12.
Let , and be a convex loss function that is 1-Lipschitz over some set with . Then for any stream of databases with points in , ERMG() is -differentially private and with probability at least produces classifiers for all that for sufficiently large constant have excess empirical risk bounded by
If is also -strongly convex,
The results in Theorem 12 all come from instantiating (the more general) Theorem 13 stated in the next section, and the proof is in Section A.2. We use the static -differentially private algorithms of [BST14] as black boxes. The differing assumptions on allow us to use different ()-black boxes with different input parameters in each case. We compare our growing bounds to these static bounds in Table 1.777To get the static bounds, we use Appendix D of [BST14], which converts bounds on expected excess empirical risk to high probability bounds. Since ERMG provides -differential privacy, we also include static -differential privacy bounds for comparison in Table 1. The static bounds are optimal in and up to log factors.
|Assumptions||Static -DP [BST14]||Static -DP [BST14]||Dynamic -DP (our results)|
|… and -strongly convex (implies )|
Note that the bounds we get for the growing setting have the same dependence on , and and better dependence on . The dependence on in our bound is roughly the square root of that in the static bounds. Compared to the static -differential privacy bounds, our dependence on is the same, while the dependence is squared relative to the static -differential privacy bounds.
Given that the growing setting is strictly harder than the static setting, it is somewhat surprising that we have no loss in most of the parameters, and only minimal loss in the size of the database . Thus, for ERM, performance in the static setting largely carries over to the growing setting.
3.2.2 A General Black Box Scheduler for Improving Accuracy
In this section we describe the general BBImprover algorithm, which achieves accuracy guarantees in the dynamic setting that improve as the database size grows. The algorithm takes in a private and accurate static black box , which it re-runs on the current database at every time step. We require the following more general definition of black box to state the privacy and accuracy guarantees of BBImprover.
Definition 5 (Definition of -black box).
An algorithm is a -black box for a class of linear queries if it is -differentially private and with probability it outputs some such that for every when for some that is independent of .
The algorithm BBImprover (Algorithm 2) will run the black box after each new data point arrives, starting at time , using time-dependent parameters . The output will be used to answer all queries that arrive at time .
The following theorem is our main result for BBImprover, which states that the algorithm is differentially private and -accurate for that decreases inverse polynomially in . The complete proof is given in Appendix A.
Let and let be a -black box for query class . Then for any database stream and stream of linear queries over , BBImprover() is -differentially private for and -accurate for sufficiently large constant and
The free parameter in Theorem 13 can be any positive constant, and should be set to an arbitrarily small constant for the algorithm to achieve the best asymptotic performance.
BBImprover does not incur accuracy loss from ignoring new data points mid-epoch as in BBScheduler because it runs at every time step. However, this also means that privacy loss will accumulate much faster than in BBScheduler because more computations are being composed. To combat this and achieve overall privacy loss , each run of will have increasingly strict (i.e., smaller) privacy parameter . The additional noise needed to preserve privacy will overpower the improvements in accuracy until the database grows sufficiently large, when the accuracy of BBImprover will surpass the comparable fixed accuracy guarantee of BBScheduler. For any , the guarantees of BBImprover are stronger when . This suggests that an analyst’s choice of algorithm should depend on her starting database size and expectations of data growth.
4 Private Multiplicative Weights for Growing Databases
In this section, we show how to modify the private multiplicative weights (PMW) algorithm for adaptive linear queries [HR10] to handle continuous data growth. The first black box process BBScheduler in the previous section shows that any algorithm can be rerun with appropriate privacy parameters at appropriate points of data growth with minimal loss of accuracy with respect to the intra-epoch data. However, in some settings it may be undesirable to ignore new data for long periods of time, even if the overall accuracy loss is small. Although BBImprover runs the black box algorithm at every step for eventual tighter accuracy bounds, these bounds are inferior until the database grows substantially. We now show how to open the black box and apply these scheduling techniques with a modification of PMW that considers all available data when a query arrives, achieving tight bounds on accuracy as soon as analysis begins and continuing through infinite data growth.
The static PMW algorithm answers an adaptive stream of queries while maintaining a public histogram reflecting the current estimate of the database given all previously answered queries. Critical to the performance of the algorithm is that it categorizes incoming queries as either easy or hard, suffering significant privacy loss only for the hard queries. Hardness is determined with respect to the public histogram: upon receipt of a query for which the histogram provides a significantly different answer than the true database, PMW classifies this as a hard query, and it updates the histogram in a way that moves it closer to a correct answer on that query. The number of hard queries is bounded using a potential argument. Potential is defined as the relative entropy between the database and the public histogram. This quantity is initially bounded, decreases by a substantial amount after every hard query, and never increases.
If we run static PMW on a growing database, the previous potential argument fails because the relative entropy between the database and the public histogram can increase as new data arrive. In the worst case, PMW can learn the database with high accuracy (using many hard queries), and then adversarial data growth can change the composition of the database dramatically, increasing the number of possible hard queries well beyond the bound for the static case. Instead, we modify PMW so that the public histogram updates not only in response to hard queries but also in response to new data arrivals. By treating the new data as coming from a uniform distribution, these latter updates incur no additional privacy loss, and they mitigate the relative entropy increase due to new data. In fact, this modification allows us to suffer only constant loss in accuracy per query relative to the static setting, while maintaining this accuracy through unbounded data growth and accumulating additional query budget during growth.
4.1 -Differentially Private PMWG
Our formal algorithm for PMW for growing databases (PMWG) is given as Algorithm 3 below. We give an overview here to motivate our main results. The algorithm takes as input a database stream and an adaptively chosen query stream . It also accepts privacy and accuracy parameters . In this section we restrict to the case where ; in Section 4.2, we allow .
The algorithm maintains a fractional histogram over , where denotes the histogram after the th query at time has been processed. This histogram is initialized to uniform, i.e., for all . As with static PMW, when a query is deemed hard, our algorithm performs a multiplicative weights update of with learning rate . As an extension of the static case, we also update the weights of when a new data entry arrives to reflect a data-independent prior belief that data arrive from a uniform distribution. That is, for all ,
It is important to note that a multiplicative weights update depends only on the noisy answer to a hard query as in the static case, and the uniform update only depends on the knowledge that a new entry arrived, so this histogram can be thought of as public.
As in static PMW, we determine hardness using a Numeric Sparse subroutine. We specify a hardness threshold of , and we additionally specify a function that varies with time and determines how much noise to add to the hardness quantities. Our most general result for -privacy (Theorem 22 in Appendix B.1) considers other noise functions, but for the results stated here, we let for appropriate constant . A query’s hardness is determined by the subroutines after adding Laplace noise with parameter . We present and analyze the required growing database modifications to Numeric Sparse and its subroutines Numeric Above Threshold and Above Threshold in Appendix C; these algorithms may be of independent interest for future work in the design of private algorithms for growing databases.
We now present our main result for PMWG, Theorem 14. We sketch its proof here and give the full proof in Appendix B.1. Whereas the accuracy results for static PMW are parametrized by the total allowed queries , our noise scaling means our algorithm can accommodate more and more queries as new data continue to arrive. Our accuracy result is with respect to a query stream respecting a query budget that increases at each time by a quantity increasing exponentially with . This budget is parametrized by time-independent , which is somewhat analogous to the total query budget in static PMW. This theorem tells us that PMWG can accommodate queries on the original database. Since degrades accuracy logarithmically, this means we can accurately answer exponentially many queries before any new data arrive. In particular, our accuracy bounds are tight with respect to the static setting, and we maintain this accuracy through unbounded data growth, subject to a generous query budget specified by the theorem’s bound on