Traditional machine learning theory quantifies how well a decision rule, learned from a limited data sample, generalizes to the entire population. The decision rule itself may enable the learner to correctly classify (as in image recognition) or predict the value of continuous statistics (as in regression) of previously unseen data samples. A standard mathematical formulation of this problem associates to each decision ruleand each sample , a loss , which may for example penalize misclassification of the data point by the decision rule. Then the learner seeks to minimize the regularized population risk:
Here, is an auxiliary function defined on that may encode geometric constraints or promote low-complexity structure (e.g., sparsity or low-rank) on . The main assumption is that the only access to the population data is by drawing i.i.d. samples from . Numerical methods then seek to obtain a high-quality solution estimate for (1.1) using as few samples as possible. Algorithmic strategies for (1.1) break down along two lines: streaming strategies and regularized empirical risk minimization (ERM).
Streaming algorithms in each iteration update a solution estimate of (1.1) based on drawing a relatively small batch of samples. Streaming algorithms deviate from each other in precisely how the sample is used in the update step. The proximal stochastic subgradient method [28, 46, 16] is one popular streaming algorithm, although there are many others, such as the stochastic proximal point and Gauss-Newton methods [15, 24, 72]. In contrast, ERM-based algorithms draw a large sample at the onset and output the solution of the deterministic problem
Solution methodologies for (1.2
) depend on the structure of the loss function. One generic approach, often used in practice, is to apply a streaming algorithm directly to (1.2) by interpreting as an expectation over the discrete distribution on the samples and performing multiple passes through the sampled data. Our current work focuses on the ERM strategy, though it is strongly influenced by recent progress on streaming algorithms.
The success of the ERM approach rests on knowing that the minimizer of the surrogate problem (1.2) is nearly optimal for the true learning task (1.1). Quantitative estimates of this type are often based on a uniform convergence principle. For example, when the functions are -Lipschitz continuous for a.e.
, then with probability, the estimate holds [62, Theorem 5]:
An important use of the bound in (1.3) is to provide a threshold beyond which algorithms for the surrogate problem (1.2) should terminate, since further accuracy on the ERM may fail to improve the accuracy on the true learning task. It is natural to ask if under stronger assumptions, learning is possible with sample complexity that is independent of the ambient dimension . In the landmark paper , the authors showed that the answer is indeed yes when the functions are convex and one incorporates further strongly convex regularization. Namely, under an appropriate choice of the parameter , the solution of the quadratically regularized problem
with probability , where is the diameter of the domain of . In contrast to previous work, the proof of this estimate is not based on uniform convergence. Indeed, uniform convergence in function values may fail in infinite dimensions even for convex learning problems. Instead, the property underlying the dimension independent bound (1.5) is that the solution of the quadratically regularized ERM (1.4) is stable in the sense of Bousquet-Elisseff . That is, the solution does not vary much under an arbitrary perturbation of a single sample . Stability of quadratically regularized ERM will also play a central role in our work for reasons that will become clear shortly.
The aforementioned bounds on the accuracy of regularized ERM are only meaningful if one can globally solve the deterministic problems (1.2) or (1.4). Convexity certainly facilitates global optimization techniques. Many problems of contemporary interest, however, are nonconvex, thereby making ERM-based learning rules intractable. When the functions are not convex but smooth, the most one can hope for is to find points that are critical for the problem (1.2). Consequently, it may be more informative to estimate the deviation in the gradients, , along with deviations in higher-order derivatives when they exist. Indeed, then in the simplest setting , the standard decomposition
When the loss is neither smooth nor convex, the situation becomes less clear. Indeed, one should reassess what “uniform convergence of gradients” should mean in light of obtaining termination criteria for algorithms on the regularized ERM problem. As the starting point, one may replace the gradient by a generalized subdifferential in the sense of nonsmooth and variational analysis [58, 44]. Then the minimal norms, and , could serve as stationarity measures akin to the norm of the gradient in smooth minimization. One may then posit that the stationarity measures, and , are uniformly close with high probability when the sample size is large. Pointwise convergence is indeed known to hold (e.g. [66, Theorem 7.54]). On the other hand, to our best knowledge, all results on uniform convergence of the stationarity measure are asymptotic and require extra assumptions, such as polyhedrality for example . The main obstacle is that the function is highly discontinuous. We refer the reader to [66, pp. 380] for a discussion. Indeed, the need to look beyond pointwise uniform convergence is well-documented in optimization and variational analysis [4, 2]. One remedy is to instead focus on graphical convergence concepts. Namely, one could posit that the Hausdorff distance between the subdifferential graphs, and , tends to zero. Here, we take a closely related approach, while aiming for finite-sample bounds.
In this work, we aim to provide tight threshold estimates beyond which algorithms on (1.2) should terminate. In contrast to previous literature, however, we will allow the loss function to be both nonconvex and nonsmooth. The only serious assumption we make is that is a -weakly convex function for a.e. , by which we mean that the assignment is convex. The class of weakly convex functions is broad and its importance in optimization is well documented [57, 50, 49, 59, 1].111Weak convexity also goes by other names such as lower-, uniformly prox-regular, and semiconvex. It trivially includes all convex functions and all -smooth functions with Lipschitz gradient. More broadly, it includes all compositions , where is convex and Lipschitz, and is
-smooth with Lipschitz Jacobian. Robust principal component analysis, phase retrieval, blind deconvolution, sparse dictionary learning, and minimization of risk measures naturally lead to stochastic weakly convex minimization problems. We refer the interested reader to[15, Section 2.1] and  for detailed examples.
The approach we take is based on a smoothing technique, familiar to optimization specialists. For any function , define the Moreau envelope and the proximal map:
It is well-known that if is -weakly convex and , then the envelope is -smooth with gradient
Note that is in principle computable by solving a convex optimization problem in the definition of the proximal point .
Our main result (Theorem 4.4) shows that with probability , the estimate holds:
The guarantee (1.6) is appealing: even though the subgradients of and may be far apart pointwise, the gradients of the smooth approximations and are uniformly close at a controlled rate governed by the sample size. Moreover, (1.6) directly implies estimates on the Hausdorff distance between subdifferential graphs, and , as we alluded to above. Indeed, the subdifferential graph is related to the graph of the proximal map by a linear isomorphism. The guarantee (1.6) is also perfectly in line with the recent progress on streaming algorithms [18, 15, 17, 78]. These works showed that a variety of popular streaming algorithms (e.g. stochastic subgradient, Gauss-Newton, and proximal point) drive the gradient of the Moreau envelope to zero at a controlled rate. Consequently, the estimate (1.6) provides a tight threshold beyond which such streaming algorithms on the regularized ERM problem (1.2) should terminate. The proof we present of (1.6) uses only the most elementary techniques: stability of quadratically regularized ERM , McDiarmid’s inequality , and a covering argument.
It is intriguing to ask when the dimension dependence in the bound (1.6) can be avoided. For example, for certain types of losses (e.g. modeling a linear predictor) there are well-known dimension independent bounds on uniform convergence in function values. Can we therefore obtain dimension independent bounds in similar circumstances, but on the deviations ? The main tool we use to address this question is entirely deterministic. We will show that if and are uniformly close, then the gradients and are uniformly close, as well as their subdifferential graphs in the Hausdorff distance. We illustrate the use of such bounds with two examples. As the first example, consider the loss modeling a generalized linear model:
Here is some feature map and is a loss function. It is well-known that if is Lipschitz, then the empirical function values converge uniformly to the population values at a dimension-independent rate that scales as in the sample size. We thus deduce that the gradient converges uniformly to at the rate . We leave it as an intriguing open question whether this rate can be improved to . The second example analyzes the landscape of a robust nonlinear regression problem, wherein we observe a series of nonlinear measurements of input data , possibly with adversarial corruption. Using the aforementioned techniques, we will show that under mild distributional assumptions on , every stationary point of the associated nonsmooth nonconvex empirical risk is within a small ball around .
Though, in the discussion above, we focused on the norm that is induced by an inner product, the techniques we present apply much more broadly to Bregman divergences. In particular, any Bregman divergence generates an associated conjugate regularization of the empirical and population risks, making our techniques applicable under non-euclidean geometries and under high order growth of the loss function. The outline of the paper is as follows. In Section 2, we introduce our notation as well as several key concepts including Bregman divergences and subdifferentials. In Section 3, we describe the problem setting. In Section 4, we obtain dimension dependent rates on the error between the gradients of the Moreau envelopes of the population and subsampled objectives. In Section 5, we illustrate the techniques of the previous section by obtaining dimension independent rates for generalized linear models and analyzing the landscape of a robust nonlinear regression problem.
This paper builds on the vast literature on sample average approximations found in the stochastic programming and statistical learning literature. The results in these communities are similar in many respects, but differ in their focus on convergence criteria. In the stochastic programming literature, much attention has been given to the convergence of (approximate) minimizers and optimal values both in the distributional and almost sure limiting sense [33, 63, 27, 64, 55, 51, 65, 32, 54]. In contrast, the statistical learning community puts a greater emphasis on excess risk bounds that hold with high probability, often with minimal or no dependence on dimension [62, 29, 68, 36, 41, 42, 12, 52, 73, 79, 69, 31].
Several previous works have studied (sub)gradient based convergence, as we do here. For example,  proves nonasymptotic, dimension dependent high probability bounds on the distance between the empirical and population subdifferential under the Hausdorff metric. The main assumption in this work, however, essentially requires smoothness of the population objective. The work  takes a different approach, directly smoothing the empirical losses . They show that the limit of the gradients of a certain smoothing of the empirical risk converges to an element of the population subdifferential. No finite-sample bounds are developed in . The most general asymptotic convergence result that we are aware of is presented in . There, the authors show that with probability one, the limit of a certain enlarged subdifferential of the empirical loss converges to an enlarged subdifferential of the population risk under the Hausdorff metric.
The two works most closely related to this paper are more recent. The paper  proves high probability uniform convergence of gradients for smooth objectives under the assumption that the gradient is sub-Gaussian with respect to the population data. The bounds presented in  are all dimension dependent and rely on covering arguments. The more recent paper , on the other hand, provides dimension independent high probability uniform rates of convergence of gradients for smooth Lipschitz generalized linear models. The main technical tool developed 
is a “chain rule” for Rademacher complexity. We note that, in contrast to therates developed in this paper,  obtains rates of the form for smooth generalized linear models.
Throughout, we follow standard notation from convex analysis, as set out for example by Rockafellar . The symbol will denote a -dimensional Euclidean space with inner product and the induced norm . Given any other norm , the symbol will denote the dual norm. For any set , we let and stand for the interior and closure of , respectively. The symbol will refer to the interior of relative to its affine hull. The closed Euclidean unit ball and the unit simplex in will be denoted by and , respectively. The effective domain of any function , denoted by , consists of all points where is finite. The indicator function of any set , denoted , is defined to be zero on and off it.
2.1 Bregman divergence.
In this work, we will use techniques based on the Bregman divergence, as is now standard in optimization and machine learning literature. For more details on the topic, we refer the reader to the expository articles [13, 30, 71]. To this end, henceforth, we fix a Legendre function , meaning:
(Convexity) is proper, closed, and strictly convex.
(Essential smoothness) The domain of has nonempty interior, is differentiable on and for any sequence converging to a boundary point of , it must be the case that .
Typical examples of Legendre functions are the squared Euclidean norm , convex polynomials , the Shannon entropy with , and the Burge function with . For more examples, see the articles [5, 9, 25, 70] or the recent survey .
The function induces the Bregman divergence defined by
for all . In the classical setting , the divergence reduces to the square deviation . Notice that since is strictly convex, equality holds for some if and only if . Therefore, the quantity measures the proximity between the two points . The divergence typically is asymmetric in and , and therefore we define the symmetrization
A function is called compatible with if the inclusion holds. Compatibility will be useful for guaranteeing that “proximal points” of induced by lie in ; see the forthcoming Theorem 3.1. Our focus will be primarily on those functions that can be convexified by adding a sufficiently large multiple of . Formally, we will say that a function is -weakly convex relative to , for some , if the perturbed function is convex. Similarly is -strongly convex relative to if the function is convex. The notions of relative weak/strong convexity are closely related to the generalized descent lemma introduced in  and the recent work on relative smoothness in [38, 37]. We postpone discussing examples of weakly convex functions to section 3.
First-order optimality conditions for nonsmooth and nonconvex problems are often most succinctly stated using subdifferentials. The subdifferential of a function at a point is denoted by
and consists of all vectorssatisfying
When is differentiable at , the subdifferential reduces to the singleton , while for convex functions it reduces to the subdifferential in the sense of convex analysis. We will call a point critical for if the inclusion holds.
When is -weakly convex relative to , the subdifferential automatically satisfies the seemingly stronger property [17, Lemma 2.2]:
for any , and any . It is often convenient to interpret the assignment as a set-valued map, and as such, it has a graph defined by
3 Problem setting.
Fix a probability space . In this paper, we focus on the optimization problem
under the following assumptions on the functional components:
(Bregman Divergence) The Legendre function satisfies the compatibility condition, Moreover is -strongly convex with respect to some norm on the set , meaning:
(Weak Convexity) The functions are -weakly convex relative to , for a.e. , and are bounded from below.
(Lipschitzian Property) There exists a square integrable function such that for all and , we have
The stochastic optimization problem (3.1) is the standard task of minimizing the regularized population risk. The function is called the loss, while is a structure promoting regularizer. Alternatively can encode feasibility constraints as an indicator function. Assumptions and are self-explanatory. Assumption asserts control on the variation in the loss . In the simplest setting when , Assumption simply amounts to Lipschitz continuity of the loss on with a square integrable Lipschitz constant . The use of the Bregman divergence allows much more flexibility, as highlighted in the recent works [8, 38, 37, 17]. On one hand, it allows one to consider losses that are Lipschitz in a non-Euclidean norm, as long as is strongly convex in that norm—a standard technique in machine learning and optimization. On the other hand, the Bregman divergence could also accommodate losses that are only locally Lipschitz continuous. The following two examples illustrate these two settings.
Example 3.1 (Non-Euclidean geometry: -setup).
In this example, we consider the instance of (3.1) that is adapted to the -norm . To this end, define the Legendre function with . Then is strongly convex with respect to the -norm with constant ; see e.g. [60, Lemma 17], [47, eqn. 5.5], [48, Corollary 2.2]. Then holds as long as the functions satisfy the estimate [17, Lemma 2.2]
for all , , and a.e. . Property , in turn, holds as long as the functions are Lipschitz continuous in the -norm, with a square integrable Lipschitz constant .
Example 3.2 (High-order growth).
As the second example, we consider the setting where the losses grow faster than linear. Namely, suppose that the Lipschitz constant of the loss on bounded sets is polynomially controlled:
where is some degree univariate polynomial with nonnegative coefficients and is square integrable. Then the result [17, Proposition 3.2] shows that (A3) holds for the Legendre function
We could in principle state and prove all the results of the paper in the Euclidean setting . On the other hand, the use of the Bregman divergence adds no complications whatsoever, and therefore we work in this greater generality. The reader should keep in mind, however, that all the results are of interest even in the Euclidean setting.
3.1 Convex compositions.
The most important example of the problem class (3.1) corresponds to the setting when is convex and the loss has the form:
where is convex and is -smooth. To see this, let us first consider the setting . Then provided that is -Lipschitz and the Jacobian is -Lipschitz, a quick argument [22, Lemma 4.2] shows that the loss is -weakly convex relative to ; therefore (A2) holds with . Moreover, if there exists a square integrable function satisfying for all and , then (A3) holds with .
The assumption that is Lipschitz continuous is mild and is often true in applications. On the other hand, Lipschitz continuity and boundedness of are strong assumptions. They can be relaxed by switching to a different Bregman divergence. Indeed, suppose that is convex and -Lipschitz as before, while now satisfies the two growth properties:
for some polynomials and , with , and some square integrable function . Define the Legendre function
Then the result [17, Propositions 3.4, 3.5] shows that assumption (A2) and (A3) hold with and .
The class of composite problems is broad and has attracted some attention lately [35, 19, 24, 20, 22, 23, 15, 78, 17, 18] as an appealing setting for nonsmooth and nonconvex optimization. Table 1 summarizes a few interesting problems of this type; details can be found in the aforementioned works.
|Censored block model||—|
|Sparse dictionary learning|
3.2 The stationarity measure and implicit smoothing.
Since the problem (3.1) is nonconvex and nonsmooth, typical algorithms can only be guaranteed to find critical points of the problem, meaning those satisfying . Therefore, one of our main goals is to estimate the Hausdorff distance between the subdifferential graphs, and . We employ an indirect strategy based on a smoothing technique.
Setting the formalism, for any function , we define the -envelope and the -proximal map:
respectively. Note that in the Euclidean setting , these two constructions reduce to the widely used Moreau envelope and the proximity map introduced in . In this case, we will omit the subscript from and . Note that in the Euclidean setting, the graphs of the proximal map and the subdifferential are closely related:
Consequently the graph of the proximal map is linearly isomorphic to the graph of the subdifferential by the linear map . It is this observation that will allow us to pass from uniform estimates on the deviations to estimates on the Hausdorff distance between subdifferential graphs, and .
It will be important to know when the set is a singleton lying in . The following theorem follows quickly from , with a self-contained argument given in [17, Theorem 4.1]. The Euclidean version of the result was already proved in . We will often appeal to this theorem without explicitly referencing it to shorten the exposition.
Theorem 3.1 (Smoothness of the -envelope).
Consider a closed and lower-bounded function that is -weakly convex and compatible with some Legendre function . Then the following are true for any positive and any .
The proximal point is uniquely defined and lies in .
If is twice differentiable on the interior of its domain, then the envelope is differentiable at with gradient given by
The main application of Theorem 3.1 is to the functions and under the assumptions (A1)-(A3). Looking at the expression (3.3), it is clear that the deviation between and provides an estimate on the size of the gradient . To make this observation precise, for any point , define the primal-dual pair of local norms:
Thus for any positive and , using (3.3), we obtain the estimate
Consequently, following the recent literature on the subject, we will treat the quantity
as a measure of approximate stationarity of at . In particular, when specializing to the target setting , it is this quantity (3.4) that streaming algorithms drive to zero at a controlled rate, as shown in [17, 15, 78].
4 Dimension Dependent Rates.
In this section, we prove the uniform convergence bound (1.6), appropriately generalized using the Bregman divergence. The proof outline is as follows. First, in Theorem 4.1 we will estimate the expected error between the true proximal point and an empirical sample,
where is fixed. The same theorem also shows is stable in the sense that does not vary much when one sample is changed arbitrarily. In the Euclidean setting, this is precisely the main result of  on stability of quadratically regularized ERM in stochastic convex optimization. Using McDiarmid’s inequality  in Theorem 4.4, we will then deduce that the quantity concentrates around its mean for a fixed . A covering argument over the points in an appropriate metric will then complete the proof.
We begin following the outlined strategy with the following theorem.
Theorem 4.1 (Stability of regularized ERM).
Consider a set and define , where both the index and the point are arbitrary. Fix an arbitrary point and a real , and set
Then the estimates hold:
We first verify (4.1). A quick computation yields for any points and the equation:
Define now the regularized functions
Then adding to both sides of (4.4), we obtain
Henceforth, set and Thus is the minimizer of and is the minimizer of . Taking into account that and are -strongly convex relative to , we deduce
Rearranging, we arrive at the estimate
Dividing through by , we obtain the claimed stability guarantee (4.1).
To establish (4.3), observe first
denotes the discrete uniform distribution. Taking into account (4.1), we obtain
Noting and yields the right-hand-side of (4.3).
Next, we pass to high probability bounds on the deviation by means of McDiarmid’s inequality . The basic result reads as follows. Suppose that a function satisfies the bounded difference property:
for all , where
are some constants. Then for any independent random variables, the random variable satisfies:
A direct application of this inequality to using (4.1) would require the Lipschitz constant to be globally bounded. This could be a strong assumption, as it essentially requires the population data to be bounded. We will circumvent this difficulty by extending the McDiarmid’s inequality to the setting where the constants are replaced by some functions of the data, and . Let be a Rademacher random variable, meaning a random random variable taking value with equal probability. Then as long as the symmetric random variables have sufficiently light tails, a McDiarmid type bound will hold. In particular, we will be able to derive high probability upper bounds on the deviations