Robustness to changes in the data and protecting the privacy of data are two of the main challenges faced by machine learning and have led to the design ofonline and differentially private learning algorithms. While offline PAC learnability is characterized by the finiteness of VC dimension, online and differentially private learnability are both characterized by the finiteness of the Littlestone dimension (PrivatePACLD; ben2009agnostic; bun2020equivalence). This latter characterization is often interpreted as an impossibility result for achieving robustness and privacy on worst-case instances, especially in classification where even simple hypothesis classes such as -dimensional thresholds have constant VC dimension but infinite Littlestone dimension.
Impossibility results for worst-case adversaries do not invalidate the original goals of robust and private learning with respect to practically relevant hypothesis classes; rather, they indicate that a new model is required to provide rigorous guidance on the design of online and differentially private learning algorithms. In this work, we go beyond worst-case analysis and design online learning algorithms and differentially private learning algorithms as good as their offline and non-private PAC learning counterparts in a realistic semi-random model of data.
Inspired by smoothed analysis (ST04), we introduce frameworks for online and differentially private learning in which adversarially chosen inputs are perturbed slightly by nature (reflecting, e.g., measurement errors or uncertainty). Equivalently, we consider an adversary restricted to choose an input distribution that is not overly concentrated, with the realized input then drawn from the adversary’s chosen distribution. Our goal is to design algorithms with good expected regret and error bounds, where the expectation is over nature’s perturbations (and any random coin flips of the algorithm). Our positive results show, in a precise sense, that the known lower bounds for worst-case online and differentially private learnability are fundamentally brittle.
Let us first consider the standard online learning setup with an instance space and a set of binary hypotheses each mapping to . Online learning is played over time steps, where at each step the learner picks a prediction function from a distribution and the adaptive adversary chooses a pair of . The regret of an algorithm is the difference between the number of mistakes the algorithm makes and that of the best fixed hypothesis in . The basic goal in online learning is to obtain a regret of . In comparison, in differential privacy the data set
is specified ahead of time. Our goal here is to design a randomized mechanism that with high probability finds a nearly optimal hypothesis inon the set , while ensuring that the computation is differentially private. That is, changing a single element of does not significantly alter the probability with which our mechanism selects an outcome. Similar to agnostic PAC learning, this can be done by ensuring that the error of each hypothesis on (referred to as a query) is calculated accurately and privately.
We extend these two models to accommodate smoothed adversaries. We say that a distribution over instance-label pairs is -smooth if its density function over the instance domain is pointwise bounded by at most
times that of the uniform distribution. In the online learning setting this means that at step, the adversary chooses an arbitrary -smooth distribution from which is drawn. In the differential privacy setting, we work with a database for which the answers to the queries could have been produced by a -smooth distribution.
Why should smoothed analysis help in online learning? Consider the well-known lower bound for -dimensional thresholds over , in which the learner may as well perform binary search and the adversary selects an instance within the uncertainty region of the learner that causes a mistake. While the learner’s uncertainty region is halved each time step, the worst-case adversary can use ever-more precision to force the learner to make mistakes indefinitely. On the other hand, a -smoothed adversary effectively has bounded precision. That is, once the width of the uncertainty region drops below , a smoothed adversary can no longer guarantee that the chosen instance lands in this region. Similarly for differential privacy, there is a -smooth distribution that produces the same answers to the queries. Such a distribution has no more than probability over an interval of width . So one can focus on computing the errors of the hypotheses with discreized thresholds and learn a hypothesis of error at most . Analogous observations have been made in prior works (NIPS2011_4262, Cohen-Addad, Gupta_Roughgarden), although only for very specific settings (online learning of -dimensional thresholds,
-dimensional piecewise constant functions, and parameterized greedy heuristics for the maximum weight independent set problem, respectively). Our work is the first to demonstrate the breadth of the settings in which fundamentally stronger learnability guarantees are possible for smoothed adversaries than for worst-case adversaries.
Our Results and Contributions.
Our main result concerns online learning with adaptive -smooth adversaries where can depend on the history of the play, including the earlier realizations of for . That is, and can be highly correlated. We show that regret against these powerful adversaries is bounded by , where is the bracketing number of with respect to the uniform distribution.111Along the way, we also demonstrate a stronger regret bound for the simpler case of non-adaptive adversaries, for which each distribution is independent of the realized inputs in previous time steps. Bracketing number is the size of an -cover of with the additional property that hypotheses in the cover are pointwise approximations of those in . We show that for many hypothesis classes, the bracketing number is nicely bounded as a function of the VC dimension. This leads to the regret bound of for commonly used hypothesis classes in machine learning, such as halfspaces, polynomial threshold functions, and polytopes. In comparison, these hypothesis classes have infinite Littlestone dimension and thus cannot be learned with regret in the worst case (ben2009agnostic).
From a technical perspective, we introduce a novel approach for bounding time-correlated non-independent stochastic processes over infinite hypothesis classes using the notion of bracketing number. Furthermore, we introduce systematic approaches, such as high-dimensional linear embeddings and -fold operations, for analyzing the bracketing number of complex hypothesis classes. We believe these techniques are of independent interest.
For differentially private learning, we obtain an error bound of ; the key point is that this bound is independent of the size of the domain and the size of the hypothesis class. We obtain these bounds by modifying two commonly used mechanisms in differential privacy, the Multiplicative Weight Exponential Mechanism of NIPS2012_4548 and the SmallDB algorithm of SmallDB. With worst-case adversaries, these algorithms achieve only error bounds of and , respectively. Our results also improve over those in DPMultWeights which concern a similar notion of smoothness and achieve an error bound of .
Other Related Works.
At a higher level, our work is related to several works on the intersection of machine learning and beyond the worst-case analysis of algorithms (e.g., (Dispersion; dekel2017online; kannan2018smoothed)) that are covered in more detail in Appendix A.
Online Learning. We consider a measurable instance space and the label set . Let be a hypothesis class on with its VC dimension denoted by . Let be the uniform distribution over with density function . For a distribution over , let be the probability density function of its marginal over . We say that is -smooth if for all , . For a labeled pair and a hypothesis , indicates whether makes a mistake on .
We consider the setting of online adversarial and (full-information) learning. In this setting, a learner and an adversary play a repeated game over time steps. In every time step the learner picks a hypothesis and adversary picks a -smoothed distribution from which a labeled pair such that is generated. The learner then incurs penalty of . We consider two types of adversaries. First (and the subject of our main results) is called an adaptive -smooth adversary. This adversary at every time step chooses based on the actions of the learner and, importantly, the realizations of the previous instances . We denote this adaptive random process by . A second and less powerful type of adversary is called a non-adaptive -smooth adversary. Such an adversary first chooses an unknown sequence of distributions such that is a -smooth distribution for all . Importantly, does not depend on realizations of adversary’s earlier actions or the learner’s actions . We denote this non-adaptive random process by . With a slight abuse of notation, we denote by and the sequence of (unlabeled) instances in and .
Our goal is to design an online algorithm such that expected regret against an adaptive adversary,
is sublinear in . We also consider the regret of an algorithm against a non-adaptive adversary defined similarly as above and denoted by .
We also consider differential privacy. In this setting, a data set is a multiset of elements from domain . Two data sets and are said to be adjacent if they differ in at most one element. A randomized algorithm that takes as input a data set is -differentially private if for all and for all adjacent data sets and , . If , the algorithm is said to be purely -differentially private.
For differentially private learning, one considers a fixed class of queries . The learner’s goal is to evaluate these queries on a given data set . For ease of notation, we work with the empirical distribution corresponding to a data set . Then the learner’s goal is to approximately compute while preserving privacy222In differentially private learning, queries are the error function of hypotheses and take as input a pair .. We consider two common paradigms of differential privacy. First, called query answering, involves designing a mechanism that outputs values for all such that with probability for every , . The second paradigm, called data release, involves designing a mechanism that outputs a synthetic distribution , such that with probability for all , . That is, the user can use to compute the value of any approximately.
Analogous to the definition of smoothness in online learning, we say that a distribution with density function is -smooth if for all . We also work with a weaker notion of smoothness of data sets. A data set is said to be -smooth with respect to a query set if there is a -smooth distribution such that for all , we have . The definition of -smoothness, which is also referred to as pseudo-smoothness by DPMultWeights, captures data sets that though might be concentrated on some elements, the query class is not capable of noticing their lack of smoothness.
Let be a hypothesis class and let be a distribution. is an -cover for under if for all , there is a such that . For any and , there an -cover under such that (HAUSSLER1995217).
We define a partial order over functions such that if and only if for all , we have . For a pair of functions such that , a bracket is defined by Given a measure over , a bracket is called an -bracket if .
Definition 2.1 (Bracketing Number).
Consider an instance space , measure over this space, and hypothesis class . A set of brackets is called an -bracketing of with respect to measure if all brackets in are -brackets with respect to and for every there is such that . The -bracketing number of with respect to measure , denoted by , is the size of the smallest -bracketing for with respect to .
3 Regret Bounds for Smoothed Adaptive and Non-Adaptive Adversaries
In this section, we obtain regret bounds against smoothed adversaries. For finite hypothesis classes , existing no-regret algorithms such as Hedge (HEDGE) and Follow-the-Perturbed-Leader (FTPL) achieve a regret bound of . For a possibly infinite hypothesis class our approach uses a finite set as a proxy for and only focuses on competing with hypotheses in by running a standard no-regret algorithm on . Indeed, in absence of smoothness of , has to be a good proxy with respect to every distribution or know the adversarial sequence ahead of time, neither of which are possible in the online setting. But when distributions are smooth, that is a good proxy for the uniform distribution can also be a good proxy for all other smooth distributions. We will see that how well a set approximates depends on adaptivity (versus non-adpativity) of the adversary. Our main technical result in Section 3.1 shows that for adaptive adversaries this approximation depends on the size of the -bracketing cover of . This results in an algorithm whose regret is sublinear in and logarithmic in that bracketing number for adaptive adversaries (Theorem 3.3). In comparison, for simpler non-adaptive adversaries this approximation depends on the size of the more traditional -covers of , which do not require pointwise approximation of . This leads to an algorithm against non-adaptive adversaries with an improved regret bound of (Theorem 3.3).
In Section 3.2, we demonstrate that the bracketing numbers of commonly used hypothesis classes in machine learning are small functions of their VC dimension. We also provide systematic approaches for bounding the bracketing number of complex hypothesis classes in terms of the bracketing number of their simpler building blocks. This shows that for many commonly used hypothesis classes — such as halfspaces, polynomial threshold functions, and polytopes — we can achieve a regret of even against an adaptive adversary.
3.1 Regret Analysis and the Connection to Bracketing Number
In more detail, consider an algorithm that uses Hedge on a finite set instead of . Then,
where the first term is the regret against the best and the second term captures how well approximates . A natural choice of is an -cover of with respect to the uniform distribution, for a small that will be defined later. This bounds the first term using the fact that there is an -cover of size . To bound the second term, we need to understand whether there is a hypothesis whose value over an adaptive sequence of -smooth distributions can be drastically different from the value of its closest (under uniform distribution) proxy . Considering the symmetric difference functions for functions and their corresponding proxies , we need to bound (in expectation) the maximum value an can attain over an adaptive sequence of -smooth distributions.
To develop more insight, let us first consider the case of non-adaptive adversaries. In the case of non-adaptive adversaries, are independent of each other, while they are not identically distributed. This independence is the key property that allows us to use the VC dimension of the set of functions to establish a uniform convergence property where with high probability every function has a value that is close to its expectation — the fact that s are not identically distributed can be easily handled because the double sampling and symmetrization trick in VC theory can still be applied as before. Furthermore, -smoothness of the distributions implies that . This leads to the following theorem for non-adaptive adversaries.
Theorem 3.1 (Non-Adaptive Adversary (haghtalab2018foundation)).
Let be a hypothesis class of VC dimension . There is an algorithm such that for any that is an non-adaptive sequence of -smooth distributions has regret
Moving back to the case of adaptive adversaries, we unfortunately lose this uniform convergence property (see Appendix B for an example). This is due to the fact that now the choice of can depend on the earlier realization of instances . To see why independence is essential, note that the ubiquitous double sampling and symmetrization techniques used in VC theory require that taking two sets of samples and from the process that is generating data, we can swap and independently of whether and are swapped for . When the choice of depends on then swapping with affects whether and could even be generated from for . In other words, symmetrizing the first variables generates possible choices for that exponentially increases the set of samples over which a VC class has to be projected, therefore losing the typical regret bound and instead obtaining the trivial regret of . Nevertheless, we show that the earlier ideas for bounding the second term of Equation 1 are still relevant as long as we can side step the need for independence.
Note that -smoothness of the distributions still implies that for a fixed function even though is dependent on the realizations , we still have . Indeed, the value of any function for which can be bounded by the convergence property of an appropriately chosen Bernoulli variable. As we demonstrate in the following lemma, this allows us to bound the expected maximum value of a chosen from a finite set of symmetric differences. For a proof of this lemma refer to Appendix C.2.
Let be any finite class of functions such that for all , i.e., every function has measure over the uniform distribution. Let be any adaptive sequence of , -smooth distributions for some such that . We have that
The set of symmetric differences we work with is of course infinitely large. Therefore, to apply Lemma 3.2 we have to approximate with a finite set such that
What should this set be? Note that choosing that is an -cover of under the uniform distribution is an ineffective attempt plagued by the the same lack of independence that we are trying to side step. In fact, while all functions are close to the constant functions with respect to the uniform distribution, they are activated on different parts of the domain. So it is not clear that an adaptive adversary, who can see the earlier realizations of instances, cannot ensure that one of these regions will receive a large number realized instances. But a second look at Equation 2 suffices to see that this is precisely what we can obtain if were to be the set of (upper) functions in an -bracketing of . That is, for every function there is a function such that . This proves Equation 2 with an exact inequality using the fact that pointwise approximation implies that the value of is bounded by that of for any set of instances that could be generated by . Furthermore, functions in are within of the constant function over the uniform distribution, so meets the criteria of Lemma 3.2 with the property that for all , . It remains to bound the size of class in terms of the bracketing number of . This can be done by showing that the bracketing number of class , that is the class of all symmetric differences in , is approximately bounded by the same bracketing number of (See Theorem 3.7 for more details). Putting these all together we get the following regret bound against smoothed adaptive adversaries.
Theorem 3.3 (Adaptive Adversary).
Let be a hypothesis class over domain , whose -bracketing number with respect to the uniform distribution over is denoted by . There is an algorithm such that for any that is an adaptive sequence of -smooth distributions has regret
3.2 Hypothesis Classes with Small Bracketing Numbers.
In this section, we analyze bracketing numbers of some commonly used hypothesis classes in machine learning. We start by reviewing the bracketing number of halfspaces and provide two systematic approaches for extending this bound to other commonly used hypothesis classes. Our first approach bounds the bracketing number of any class using the dimension of the space needed to embed it as halfspaces. Our second approach shows that -fold operations on any hypothesis class, such as taking the class of intersections or unions of all
hypotheses in a class, only mildly increase the bracketing number. Combining these two techniques allows us to bound the bracketing number of commonly used classifiers such as halfspaces, polytopes, polynomial threshold functions, etc.
The connection between bracketing number and VC theory has been explored in recent works. adams2010; adams2012 showed that finite VC dimension class also have finite -bracketing number but AlonWezlHaussler (see UGC for a modern presentation) showed the dependence on can be arbitrarily bad. Since Theorem 3.3 depends on the growth rate of bracketing numbers, we work with classes for which we can obtain -bracketing numbers with reasonable growth rate, those that are close to the size of standard -covers.
Theorem 3.4 (Bracket_Halfspaces).
Let be the class of halfspaces over . For any and any measure over ,
Our first technique uses this property of halfspaces to bound the bracketing number of any hypothesis class as a function of the dimension of the spaces needed to embed this class as halfpsaces.
Definition 3.5 (Embeddable Classes).
Let be a hypothesis class on . We say that is embeddable as halfspaces in dimensions if there exists a map such that for any , there is a linear threshold function such .
Theorem 3.6 (Bracketing Number of Embeddable Classes).
Let be a hypothesis class embeddable as halfspaces in dimensions. Then, for any measure ,
Our second technique shows that combining classes, by respectively taking intersections or unions of any functions from them, only mildly increases their bracketing number.
Theorem 3.7 (Bracketing Number of -fold Operations).
Let be hypothesis classes. Let and be the class of all hypotheses that are intersections and unions of functions , respectively. Then,
For any hypothesis class and ,
We now use our techniques for bounding the bracketing number of complex classes by the bracketing number of their simpler building blocks to show that online learning with an adaptive adversary on a class of halfspaces, polytopes, and polynomial threshold functions has regret.
Consider instance space and let be an arbitrary measure on . Let be the class of -degree polynomial thresholds and be the class -polytopes in . Then,
for some constants and . Furthermore, there is an online algorithm whose regret against an adaptive -smoothed adversary on the class and is respectively and .
4 Differential Privacy
In this section, we consider smoothed analysis of differentially private learning in query answering and data release paradigms. We primarily focus on -smooth distributions and defer the general case of -smooth distributions to Appendix G. For finite query classes and small domains, existing differentially private mechanisms achieve an error bound that depends on and . We leverage smoothness of data sets to improve these dependencies to and .
An Existing Algorithm.
NIPS2012_4548 introduced a practical algorithm for data release, called Multiplicative Weights Exponential Mechanism (MWEM). This algorithm works for a finite query class over a finite domain . Given an data set and its corresponding empirical distribution , MWEM iteratively builds distributions for , starting from that is the uniform distribution over . At stage , the algorithm picks a that approximately maximizes the error using a differentially private mechanism (Exponential mechanism). Then data set is updated using the multiplicative weights update rule where
is a differentially private estimate (via Laplace mechanism) for the value. The output of the mechanism is a data set . The formal guarantees of the algorithm are as follows.
Theorem 4.1 (Nips2012_4548).
For any data set of size , a finite query class , and , MWEM is -differentially private and with probability at least produces a distribution over such that
The analysis of MWEM keeps track of the KL divergence and shows that at time this value decreases by approximately the error of query . At a high level, . Moreover, KL divergence of any two distributions is non-negative. Therefore, error of any query after steps follows the above bound.
To design a private query answering algorithm for a query class without direct dependence on and we leverage smoothness of distributions. Our algorithm called the Smooth Multiplicative Weight Exponential Mechanism (Smooth MWEM), given an infinite set of queries , considers a -cover under the uniform distribution. Then, it runs the MWEM algorithm with as the query set and constructs an empirical distribution . Finally, upon being requested an answer to a query , it responds with , where is the closest query to under the uniform distribution. This algorithm is presented in Appendix E. Note that does not depend on the data set . This is the key property that enables us to work with a finite -cover of and extend the privacy guarantees of MWEM to infinite query classes. In comparison, constructing a -cover of with respect to the empirical distribution uses private information.
Let us now analyze the error of our algorithm and outline the reasons it does not directly depend on and . Recall that from the -smoothness, there is a distribution that is -smooth and for all . Furthermore, can be taken to be a subset of and thus is -smooth with respect to . The approximation of by a -cover introduces error in addition to the error of Theorem 4.1. This error is given by . Note that , therefore, this removes the error dependence on the size of the query set while adding a small error of . Furthermore, Theorem 4.1 dependence on is due to the fact that for a worst-case (non-smooth) data set , can be as high as . For a -smooth data set, however, . This allows for faster error convergence. Applying these ideas together and setting gives us the following theorem whose proof is deferred to Appendix E.
For any -smooth dataset of size , a query class with VC dimension , and , Smooth Multiplicative Weights Exponential Mechanism is -differentially private and with probability at least , calculates values for all such that
Above we described a procedure for query answering that relied on the construction of a data set. One could ask whether this leads to a solution to the data release problem as well. An immediate, but ineffective, idea is to output distribution constructed by our algorithm in the previous section. The problem with this approach is that while for all queries in the cover , there can be queries for which is quite large. This is due to the fact that even though is -smooth (and is -smooth), the repeated application of multiplicative update rule may result in distribution that is far from being smooth.
To address this challenge, we introduce Projected Smooth Multiplicative Weight Exponential Mechanism (Projected Smooth MWEM) that ensures that is also -smooth by projecting it on the convex set of all -smooth distributions. More formally, let be the polytope of all -smooth distributions over and let be the outcome of the multiplicative update rule of NIPS2012_4548 at time . Then, Projected Smooth MWEM mechanism uses To ensure that these projections do not negate the progress made so far, measured by the decrease in KL divergence, we note that for any and any , we have That is, as measured by the decrease in KL divergence, the improvement with respect to can only be greater than that of . Optimizing parameters and , we obtain the following guarantees. See Appendix F for more details on Projected Smooth MWEM mechanism and its analysis.
Theorem 4.3 (Smooth Data Release).
Let be a -smooth database with data points. For any and any query set with VC dimension , Projected Smooth Multiplicative Weight Exponential Mechanism is differentially private and with probability at least its outcome satisfies
5 Conclusions and Open Problems
Our work introduces a framework for smoothed analysis of online and private learning and obtain regret and error bounds that depend only on the VC dimension and the bracketing number of a hypothesis class and are independent of the domain size and Littlestone dimension.
Our work leads to several interesting questions for future work. The first is to characterize learnability in the smoothed setting — via matching lower bounds — in terms of a combinatorial quantity, e.g., bracketing number. In Appendix D, we discuss sign rank and its connection to bracketing number as a promising candidate for this characterization. A related question is whether there are finite VC dimension classes that cannot be learned in presence of smoothed adaptive adversaries.
Let us end this paper by noting that the Littlestone dimension plays a key role in characterizing learnability and algorithm design in the worst-case for several socially and practically important constraints (ben2009agnostic; AlonWezlHaussler). It is essential then to develop models that can bypass Littlestone impossibility results and provide rigorous guidance in achieving these constraints in practical settings.
This work was partially supported by the NSF under CCF-1813188, the ARO under W911NF1910294 and a JP Morgan Chase Faculty Fellowship.
Appendix A Additional Related Work
Analogous models of smoothed online learning have been explored in prior work. NIPS2011_4262 consider online learning when the adversary is constrained in several ways and work with a notion of sequential Rademacher complexity for analyzing the regret. In particular, they study a related notion of smoothed adversary and show that one can learn thresholds with regret of in presence of smoothed adversaries. Gupta_Roughgarden consider smoothed online learning in the context online algorithm design. They show that while optimizing parameterized greedy heuristics for Maximum Weight Independent Set imposes linear regret in the worst-case, in presence of smoothing this problem can be learned with sublinear regret (as long they allow per-step runtime that grows with ). Cohen-Addad consider the same problem with an emphasis on the per-step runtime being logarithmic in . They show that piecewise constant functions over the interval can be learned efficiently within regret of against a non-adaptive smooth adversary. Our work differs from these by upper bounding the regret using a combinatorial dimension of the hypothesis class and demonstrating techniques that generalize to large class of problems in presence of adaptive adversaries.
In another related work, Dispersion introduce a notion of dispersion in online optimization (where the learner picks an instance and the adversary picks a function) that is a constraint on the number of discontinuities in the adversarial sequence of functions. They show that online optimization can be done efficiently under certain assumptions. Moreover, they show that sequences generated by non-adaptive smooth adversaries in one dimension satisfy dispersion. In comparison, our main results in online learning consider the more powerful adaptive adversaries.
Smoothed analysis is also used in a number of other online settings. In the setting of linear contextual bandits, kannan2018smoothed use smoothed analysis to show that the greedy algorithm achieves sublinear regret even though in the worst case it can have linear regret. raghavan2018externalities work in a Bayesian version of the same setting and achieve improved regret bounds for the greedy algorithm. Since several algorithms are known to have sublinear regret in the linear contextual bandit setting even in the worst-case, the main contribution of these papers is to show that the simple and practical greedy algorithm has much better regret guarantees than in the worst-case. In comparison, we work with a setting where no algorithm can achieve sublinear regret in the worst-case.
Smoothed analysis has also been considered in the context of differential privacy. DPMultWeights consider differential privacy in the interactive setting, where the queries arrive online. They analyze a multiplicative weights based algorithm whose running time and error they show can be vastly improved in the presence of smoothness. Some of our techniques for query answering and data release are inspired by that line of work. Dispersion also differential privacy in presence of dispersion and analyze the gaurantees of the exponential mechanism.
Generally, our work is also related to a line of work on online learning in presence of additional assumptions resembling properties exhibited by real life data. PredictableSequences consider settings where additional information in terms of an estimator for future instances is available to the learner. They achieve regret bounds that are in terms of the path length of these estimators and can beat if the estimators are accurate. dekel2017online also considers the importance of incorporating side information in the online learning framework and show that regrets of
in online linear optimization maybe possible when the learner knows a vector that is weakly correlated with the future instances.
More broadly, our work is among a growing line of work on beyond the worst-case analysis of algorithms [roughgarden_2020] that considers the design and analysis of algorithms on instances that satisfy properties demonstrated by real-world applications. Examples of this in theoretical machine learning mostly include improved runtime and approximation guarantees of numerous supervised (e.g., [LearningSmoothed, kalai2008decision, onebit, Masart]), and unsupervised settings (e.g., [bilu_linial_2012, kcenter, stable_clustering, TopicModelling, Decoupling, VDW, MMVMaxCut, llyods, HardtRoth]).
Appendix B Lack of Uniform Convergence with Adaptive Adversaries
The following example for showing lack of uniform convergence over adaptive sequences is due to haghtalab2018foundation and is included here for completeness.
Let and be the set of one-dimensional thresholds. Let the distribution of the noise be the uniform distribution on . Let and if while otherwise. In this case, we do not achieve concentration for any value of , as
Appendix C Proofs from Section 3
c.1 Algorithm and its Running Time
While our main focus is to provide sublinear regret bounds for smoothed online learning our analysis also provides an algorithmic solution describe below.
The running time of the algorithm comprises of the initial construction of and then running a standard online learning algorithm on .
Standard online learning algorithms such as Hedge and FTPL take time polynomial in the size of the cover since in standard implementations they maintain a state corresponding to each hypothesis in . In our setting, the size of the cover is .
The time required to construct a cover depends on the access we have to the class. One method is to randomly sample a set with points from the domain uniformly and construct all possible labelings on this set induced by the class. The number of labellings of is bounded by by the Sauer–Shelah lemma. The cover is constructed by then finding functions in the class that are consistent with each of these labellings. This requires us to be able to find an element in the class consistent with a given labeling, which can be done by a “consistency” oracle. Naively, the above makes calls to the consistency oracle, one for each possible labeling of .
The above analysis and runtime can be improved in several ways. First, can be constructed in time rather than . This can be done by constructing the cover in a hierarchical fashion, where the root includes the unlabeled set and at every level one additional instance in is labeled by or . At each node, the consistency oracle will return a function that is consistent with the labels so far or state that none exists. Nodes for which no consistent hypothesis so far exists are pruned and will not expand in the next level. Since the total number of leaves is the number of ways in which can be labeled by , i.e., , the number of calls to the consistency oracle is as well. The runtime of standard online learning algorithms can also be improved significantly when an empirical risk minimization oracle is available to the learner, in which case a runtime of for general classes [HazanKoren] or even for structured classes [Oracle_Efficient] is possible.
c.2 Proof of Lemma 3.2
At a high level, note that any has measure at most on any (even adaptively chosen) -smooth distribution. Therefore, for any fixed , . To achieve this bound over all , we take a union bound over all such functions.
More formally, for any
Consider a fixed . Note that even when the choice of a -smoothed distribution depends on earlier realizations of , . Therefore, for
is stochastically dominated by that of a binomial distribution. Note that is a monotonically increasing functions and let . We have
Let . Note that because , we have . Hence, by replacing in the above inequality we have
c.3 Proof of Theorem 3.3
Consider any hypothesis class and an algorithm that is no-regret with respect to any adaptive adversary on hypotheses in . It is not hard to see that
Therefore, it is sufficient to choose an of moderate size such that every function has a proxy even when these functions are evaluated on instances drawn from a non-iid and adaptive sequence of smooth distributions. We next describe the choice of .
Let be a -net of with respect to the uniform distribution , for an that we will determine later. Note that any -bracket with respect to is also an -net, so .333Alternatively, we can bound by HAUSSLER1995217. Let be the set of symmetric differences between and its closest proxy , that is,
Note that because is a subset of all the symmetric differences of two functions in , by Theorem 3.7 its bracketing number is bounded as follows.
Let be the set of upper -brackets of with respect to , i.e., for all , there is such that for all , and . Note that