Adaptive Sampling for Convex Regression

08/14/2018 ∙ by Max Simchowitz, et al. ∙ berkeley college University of Washington 0

In this paper, we introduce the first principled adaptive-sampling procedure for learning a convex function in the L_∞ norm, a problem that arises often in economics, psychology, and the social sciences. We present a function-specific measure of complexity and use it to prove that our algorithm is information-theoretically near-optimal in a strong, function-specific sense. We also corroborate our theoretical contributions with extensive numerical experiments, finding that our method substantially outperforms passive, uniform sampling for favorable synthetic and data-derived functions in low-noise settings with large sampling budgets. Our results also suggest an idealized `oracle strategy', which we use to gauge the potential for deploying the adaptive-sampling strategy on any function in any particular setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many functions that model individual economic utility, the output of manufacturing processes, and natural phenomena in the social sciences are either convex or concave. For example, convex functions are used to model utility functions that exhibit temporal discounting, a classic effect in behavioral economics where people value immediate rewards over delayed rewards (Frederick et al., 2002; Green and Myerson, 2004). To measure such curves, it is common practice to manipulate a variable (e.g., delay) over a fixed, uniformly spaced grid of design points 5 points (Fisher, 1937)

, collect many repeated trials of data, and fit a function of assumed parametric form (e.g., exponential or hyperbolic) using maximum likelihood estimation. This approach can be brittle to model mismatch when the true function

lies outside the assumed class of functions. Moreover, non-linear parametric families can introduce challenges for constructing faithful and accurate confidence intervals when interpolating the estimator between measured design points.

Non-parametric convex regression (c.f. Dümbgen et al. (2004)) corrects for the shortcomings of parametric methods by making no assumptions other than that is convex. In additional to faithfully modeling a large class of functions, non-parametric methods can also be employed to construct error bars at any (see Cai et al. (2013)). Unfortunately, even with shape restrictions, non-parametric methods may require prohibitively many samples for practical use.

In this paper, we propose a more parsimonious approach to non-parametric curve estimation by allowing the design points to be chosen sequentially and adaptively. Formally, we consider the problem of estimating an unknown convex function with an estimator which is close in the metric  . The estimator is constructed from sequential, noisy evaluations from an oracle at design points ,

(1)

and where represents zero-mean noise. We let denote the filtration generated by the design points and measurements up to time , and assume that the number of samples is a stopping time with respect to , where is zero mean, -subgaussian. We refer to measurement allocation strategies for which does not depend on as passive, and adaptive sampling strategies for which may depend on as active.

Our main contributions are the following:

  • Inspired by Cai et al. (2013), we introduce the local approximation modulus, a new measure of local curvature for convex functions, , and a function-specific complexity measure , called the average approximation modulus. coincides with the average curvature of , up to logarithmic factors and endpoint considerations. We prove a function-specific lower bound on the sample complexity of actively estimating any convex function to -error that scales at least as , up to logarithmic factors.

  • The packing argument for constructing our lower bound explicitly describes a near-optimal, clairvoyant sampling allocation tailored to ; we call this the “oracle allocation” (Proposition 3.4). This allocation is instructive as an experimental benchmark when is known.

  • We introduce an active sampling procedure and an estimator whose sample complexity for any particular scales as up to logarithmic factors, nearly matching our lower bounds.

  • We show that for passive designs (e.g., sampled evenly on a grid), the sample complexity necessarily scales as , where coincides with the maximum curvature. We compare and for many natural classes of functions, including quadratic functions, exponential curves, and -piecewise linear functions. For -piecewise linear functions, the error of our active algorithm scales no slower than , whereas passive designs scale no faster than after evaluations (see Remark 3.2).

Finally, we validate our theoretical claims with an empirical study using both synthetic functions and those derived from real data. We observe that in low-noise settings or when the sampling budget is large, active sampling can substantially outperform passive uniform sampling. Moreover, our algorithm constitutes the first theoretically justified algorithm (passive or active) that guarantees uniform accuracy, even at the boundaries of the interval (Cai et al., 2013; Dümbgen et al., 2004). Even so, comparing the performance of our active algorithm to the oracle sampling strategy suggests room for modest but non-negligible improvements.

1.1 Related Work

Castro et al. (2005) and Korostelev (1999)

studied the minimax rates of active non-parametric regression, showing that active and passive learning attain the same minimax rates of convergence for Holder smooth classes, but that active learning achieves faster rates when the function is known to be well approximated by a piecewise-constant function.

Prior literature on convex and concave regression consider the passive design case, where the design points do not depend on measurements. Typically, the design points are chosen to be uniformly spaced on the unit interval, that is, for (Dümbgen et al., 2004). If is the set of Lipschitz, convex functions, then the -norm of the least squares estimator decreases like , whereas if the convex function has Lipschitz gradients, the rate improves to (Dümbgen et al., 2004).

Recent work by Guntuboyina and Sen (2015) and Chatterjee (2016) has aimed at developing sharp errors bounds on the squared -norm of the least squares estimator, when samples are uniformly spaced on a grid. They show that even with this uniform allocation, the error adapts to the true regression functions . For example, Chatterjee (2016) and Bellec et al. (2018) show that if is a -piecewise linear function, then obtains the parametric error rate of . In a similar vein, Cai et al. (2013) proves sharp confidence intervals for for a fixed point .

Our work draws heavily upon Cai et al. (2013) (who in turn build on Dümbgen et al. (2003)), whose aim was to characterize the function-specific sample complexity of estimating a convex at a given point in the interior of , from uniform measurements. We extend these tools to characterize the complexity of estimating with uniform accuracy over the interval , from measurements which may be chosen in an adaptive, function-dependent manner. We are thus able to obtain exceptionally granular, instance-specific results similar to those in the multi-arm bandit literature (Kaufmann et al., 2016), and in recent work studying the local minimax sample complexity of convex optimization (Zhu et al., 2016).

2 Efficiently Learning a Convex Function

We begin by establishing preliminary notation. The class of convex functions is denoted as . For an interval , define the left-, middle- and right-endpoints as . We define the secant approximation of on an interval as

(2)

and note that for a convex function, this approximation never underestimates ; that is, one has . We denote the error of the second approximation to on at the midpoint as

In addition, we overload notation so that for any such that , we have .

We now state a remarkable fact about convex functions that is at the core of our analysis.

Lemma 2.1

For convex , .

Lemma 2.1 is a special case of a more general lemma stated in Section 6 that upper bounds the supremum of the secant approximation error by a constant using only a single point within the interval. Convexity is critical to the proof of this lemma and such a property does not hold, for instance, on merely monotonic functions. We remark that the first inequality is trivial, and the second inequality is tight in the sense that it is achieved by on interval as .

The above observations motivate our strategy of approximating with secant approximations on disjoint intervals whose union is . The next definition relates the secant approximation error to the required sampling density.

Definition 1 (Local Approximation Modulus)

We define the -approximation modulus of at a point as the least such that the midpoint secant approximation to on has bias :

(3)

Note that for all , because convex functions are continuous on their domain. Intuitively, is the scale at which “looks” linear around some , up to a tolerance . Away from the endpoints , smaller values of correspond to larger complexities, because they imply that can only be approximated by a linear function on a small interval. But if is the near , will take small values, potentially overestimating the local complexity. We remedy this issue by defining the following left- and right-approximation points:

Within , we will show that the midpoint errors concisely describe how densely one would need to sample in the neighborhood of in order to estimate it to the desired accuracy in the -norm. Moreover, we show that it suffices to sample at constant number of design points on the end-intervals and . At a high level, the main finding of this paper is as follows:

The sample complexity of learning a particular convex function up to accuracy in with passive sampling is parametrized by the worst-case approximation modulus

(4)

In contrast, the sample complexity of active sampling algorithms is parametrized by the average approximation modulus

(5)

We emphasize that the algorithm presented in in this work guarantees accuracy on the whole interval , whereas many passive algorithms pointwise and risk bounds (Cai et al., 2013; Dümbgen et al., 2004) only guarantee accuracy on a strictly smaller sub-interval.

2.1 Examples

Explicit parameterizations of provide intuition for when active sampling is advantageous. In this section, we describe different scalings of and for various ; later, in Remark 3.2, we explain how these scalings can imply substantial differences in sample complexity.

2.1.1 Piecewise Linear Functions

Let be a Lipschitz, piecewise linear convex function with a constant number of pieces. It follows that where is the distance to the closest knot adjoining any two linear pieces of . It follows that whereas .

2.1.2 Bounded third-derivative:

Suppose . We may apply a Taylor series to find as , which makes an explicit connection between the curvature of the function and the differences between and . This suggests that if the function has areas of high but localized curvature such as or then the difference between and can be as vast as versus .

2.1.3 Quadratic Functions:

Let for some real coefficients . Ignoring the effect of endpoints, for all due to the function having constant curvature, so .

3 Main Results

In this section, we state a formal upper bound obtained by Algorithm 2, described in Section 4. Algorithm 2 takes in a confidence parameter , as well as a second parameter governing the degree to which the active sampling algorithm is ‘aggressive’; from simulations, we recommend setting . Lastly, at each round, Algorithm 2 maintains an estimator , whose performance is characterized by the following theorem:

Theorem 3.1

Let be a universal constant, and for , and , define , and

Then, if Algorithm 2 is run with parameters and , with access to an oracle (1), the estimators and confidence estimates defined in Section 4.2 satisfy the following any-time guarantee:

In the case of the default parameter setting , we find that, for a possibly larger universal constant ,

Up to constants and logarithmic factors, the sample complexity is dominated by the term . Here corresponds to the standard rate for estimating a scalar. The dependence on captures the number of points required to estimate with a discretized proxy.

To better understand why is the appropriate quantity to consider, we now introduce a construction of local packings of , centered at a given . Recall that denotes the error of the secant approximation to on the midpoint of , constructed using the endpoints . We note that if any algorithm, even an active one, does not measure on an interval for which , then one cannot distinguish between and the alternative function . Thus, a key step to showing that approximately lower bounds the number of evaluations is to show that it approximately lower bounds the number of intervals for which . This is achieved in the following theorem proved in Section 5.3:

Theorem 3.2 (Packing)

Let be a convex function, , and define

(6)

where and is defined analogously. Then, there is an such that the points such that the intervals

have disjoint interiors, are contained in and satisfy . Moreover, the interval endpoints overlap so that .

Note that corresponds to the actual size of the explict packing, and is a computable lower bound on . We now consider the class of alternative functions

(7)

We observe that , and by definition, if are distinct, then . In particular, given any set of points for , then there exist two convex functions in , such that for all and . In Section 5.2, we formalize this argument to yield the following theorem:

Theorem 3.3

Fix an , , and . Let be as in Lemma 3.2, and let and be as given by Equation (7). Let be any active algorithm that returns an estimator at a stopping time , and satisfies the correctness guarantee

(8)

Then the stopping time , under observations from , is lower bounded by

and the average sample complexity over is at least

(9)

The above bounds hold when is replaced by .

Remark 3.1

The additional logarithmic factor that arises in (9) is due to the fact that estimating a function to -error corresponds to correctly performing simultaneous hypothesis tests, regarding the value of on each of the intervals . However, for any fixed (and, in particular, ), one can devise an algorithm that does not suffer this logarithmic factor by ‘biasing’ the algorithm towards that function.

In addition to providing a lower bound, the packing of Theorem 3.2 defines a near-optimal covering as well, in the sense that it defines a sampling allocation that can be used to test the hypothesis that, for a given , versus . Formally, we have the following:

Proposition 3.4

For every function , and , there exists a

a deterministic sampling allocation , and a test function constructed from the allocation such that

The design is explicitly constructed in Section 5.3 by augmenting the -intervals in Theorem 3.2 with at most three additional intervals to ensure coverage of all of . Crucially, we made use of the fact that intervals share endpoints, and have secant error exactly equal to . In light of Theorem 3.3, we see that the design is optimal for verifying that , up to scaling by constant factors. For this reason, we refer to this construction as the oracle allocation since it precisely characterizes the optimal sampling allocation taken if one knew . In general, this allocation may be too optimistic, since an algorithm which does not know the true cannot choose this allocation a fortiori.

3.1 Comparison between Upper and Lower Bounds

For the purpose of comparing upper and lower bounds, we will consider running Algorithm 2 with the setting ; any constant bounded away from zero will yield qualitatively similar results. We find that the upper bound of Theorem 3.1 and lower bound of Theorem 3.3 nearly match, with the following exceptions:

  1. The upper bound involves a doubly logarithmic factor that depends on . This is a consequence of the law of the iterated logarithm, which Algorithm 2 uses to maintain uniform correctness of its confidence intervals over time.

  2. Theorem 3.1 is given in terms of , whereas our lower bound is stated in terms of . The two quantities can be related by the following proposition, proved in Section 6.4.

    Proposition 3.5

    For any , and any convex , for all . Moreover,

    Hence, ignoring the contributions of the endpoints and , rescaling by a multiplicative constant changes by at most .

  3. Lastly, the upper and lower bounds differs in that requires dividing through by . We conjecture that the lower bound more accurately reflects the true sample complexity; see Remark A.1.

3.2 Sample Complexity for Passive Designs

In this section, we show that the sample complexity for estimating a convex function with an approximately uniform passive design up to error is governed by the parameter .

Theorem 3.6

Consider a (possibly randomized) passive design , which is uniform in the sense that, for some , and any interval with , one has that . Then, for a universal constant , any and all such that , there exists an alternative such that

The proof for the above theorem is as follows. Let

which intuitively corresponds to the point with the highest local curvature. Further, let , so that . If we consider the alternative function

(10)

then by construction, and differ only on and . So if can estimate up to -norm error , then can distinguish between and . Consequently, standard information-theoretic arguments (Section 5.2) imply that any sampling algorithm must collect samples within . Theorem 3.6 then follows by the uniformity of the sampling procedure. In the case where the design is passive but not uniform, it is possible that the design performs well on particular functions . In Remark A.2, we show that nevertheless, if the design is not uniform, it will underperform on a ‘translation’ of .

Remark 3.2

(Piecewise linear) If is Lipschitz and piecewise linear with a constant number of pieces, then from Section 2.1 we have whereas . Theorem 3.6 implies that any -correct passive sampling procedure requires measurements whereas Theorem 3.1 says that our active sampling procedure takes just . Thus, after total samples the of passive sampling decays no faster than whereas active sampling decays like .

4 Recursive Secant Approximation

We now introduce the recursive secant approximation algorithm for learning a convex function with noise. We begin by sampling each endpoint once. Subsequently, let denote the number of samples taken, and let denote a binary tree of intervals contained in , where the children of an interval are given by and . We let denote the set of leaves of . By construction, immediately satisfies the following properties stipulated in Lemma 4.1:

Lemma 4.1

For any , we have for any ; ; and .

At each round , we maintain three estimates of . First, an estimator of defined only at the points . Second, a secant-approximation estimator which extends the domain of to all of via:

(11)

Note that is well defined, since by Lemma 4.1, for all , (a) there exists an such that and (b) if for , then is a common endpoint of and , and thus the secant approximations coincide at so that . Lastly, since is not guaranteed to be convex when measurements are noisy, we define an estimator via an projection onto ,

(12)

By definition so that by the triangle inequality. When not clear from context, we employ the use of a subscript on to denote these functions once samples have been taken.

4.1 Recursive Secant Approximation without Noise

1 Initialize Tree of intervals , estimate for
2 For ,
3 For samples
4 (break ties arbitrarily)
5 If , observe ,
Else Insert and into as children of ; observe .
Algorithm 1 Noiseless Recursive Secant Approximation

To build intuition for Algorithm 2, we consider the following noiseless variant of our main algorithm, Algorithm 1, where the oracle returns noiseless queries . In this case, is set to be equal to at each point that is queried, and a placeholder value of elsewhere. The algorithm maintains the invariant that, for all , and have been measured and recorded in . Moreover, since the queries are noiseless, the secant approximation is convex and no projection is required.

At each round, Algorithm 1 queries the interval for which the secant bias is largest; note that if there is an interval for which has not been sampled, then and , and will be queried, with ties broken arbitrarily. In preparation for the analysis of the noise-tolerant algorithm, we shall analyze the stopping time:

(13)

We shall prove the following proposition:

Proposition 4.2

For all , . Moreover, for any we have

Proof  Since for all queried points , we have that, for , . Moreover, since for any , is constructed using secant approximations on a refinement of the intervals , (see Lemma 6.1.)

It remains to bound . Let denote the set of points sampled at the start of round ; in the noiseless setting, , but bounding will be of broader interest for the noise-tolerant algorithm. Since are the endpoints of the intervals , which are adjacent, we have . Moreover, if denotes the parent-intervals of , we have . Thus, to bound