Stochastic minimization with exponential concave (or exp-concave in short) loss functions can find many applications in machine learning, e.g., linear regression, logistic regression, support vector machine with squared hinge loss and portfolio optimization. There are two popular approaches for stochastic optimization. The first approach is calledSample Average Approximation (also known as empirical risk minimization in machine learning vapnik-1998-statistical ), in which a set of i.i.d examples are drawn from the underlying distribution and an empirical risk minimization problem is solved. The second approach is Stochastic Approximation SA:Springer (closely related to online optimization), which iteratively learns the model from randomly sampled examples. Comparing with stochastic approximation, empirical risk minimization is deemed as more general and usually achieves better performance in practice. Importantly, it is amenable to any optimization algorithms.
Fast rates of optimization with exponential concave functions in online setting or in stochastic setting have attracted a bulk of studies. In the seminal paper by Hazan et al. ML:Hazan:2007 , the authors proposed an online Newton step (ONS) algorithm - a reminiscent of Newton-Raphson method for offline optimization, which achieves an regret bound with being the dimension of the problem and being the total number of iterations. With the standard trick of online-to-batch conversion TIT04:Bianchi ; DBLP:conf/nips/KakadeT08 , one can obtain a fast convergence rate of recently achieved in stochastic setting arXiv:1605.01288 ; DBLP:conf/colt/Mahdavi0J15 . However, the computational cost of ONS scales badly with the dimensionality of the problem (with a factor) DBLP:conf/nips/KorenL15 , which may prohibit its application to high-dimensional problems.
In terms of empirical risk minimization (ERM), it was not until recently that the fast rates for exp-concave risk minimization were established. Koren & Levy DBLP:conf/nips/KorenL15 obtained the first result for exp-concave risk minimization that ERM is able to attain fast generalization rates, i.e., an expected convergence bound - difference between the risk of the learned model and the risk of the optimal model. Strictly speaking, their guarantee is not for the solution to ERM but for the solution to a penalized ERM by adding a strongly convex regularizer to the ERM objective. Gonen and Shalev-Shwartz DBLP:journals/corr/abs/1601.04011
derived a similar expectational fast rate for supervised learning with exp-concave losses. Recently, MehtaarXiv:1605.01288 established high probability fast rates for exp-concave empirical risk minimization, which is only worse by a factor of than the in-expectation rate.
This paper is motivated by solving the following stochastic composite optimization problem:
where the objective consists of a stochastic component that is the expectation over a random function and a deterministic component . In this paper, we will assume: (i) is a compact and convex set; (ii) is a smooth and -exp-concave function of for any , and is Lipschitz continuous over the bounded domain . To make it general, we do not impose strong convexity or exp-concave or smoothness assumption on except for convexity.
We study the convergence of the empirical minimizer of (1), i.e.,
where are i.i.d samples from . Our major goal here is to establish the fast convergence rate of the empirical minimizer in terms of . This is in contrast to many previous works focusing on convergence analysis of stochastic approximation algorithms Lan:2010:Optimal for solving (1). It is noticeable that many efficient optimization algorithms are available for solving (2) xiao2014proximal ; DBLP:conf/nips/DefazioBL14 . In machine learning applications, the deterministic component is usually a regularizer that enforces some kind of structure over the model . Many studies in machine learning and statistics have found that using a certain kind of regularization that incorporates prior knowledge about the model can lead to great improvements on performance in many applications.
To establish the convergence rate of the empirical minimizer (2), one may consider to define a new loss such that and , and then leverage the existing theory to prove the convergence rate. However, the combined function is not necessarily an exp-concave function (see Example 2 below). Therefore, all the previous fast rates analysis for exp-concave empirical risk minimization cannot carry over to the considered problem. As a consequence, the standard generalization theory of ERM COLT:Shalev:2009 applied to (2) can only guarantee an convergence rate, which is worse than - the fast rate that we aim to establish.
Contributions. The main contribution of this work is a simple analysis of a fast rate of with high probability for the empirical minimization (2) with a -exp-concave losses and an arbitrary convex regularizer in terms of . Our proof is simple and elementary, which only utilizes the covering number of and a concentration inequality of random vectors.
2 Comparison with Previous Works
There are extensive studies about fast rates of ERM. Due to limit of space, the review below focuses on closely related work. In the three recent studies DBLP:conf/nips/KorenL15 ; DBLP:journals/corr/abs/1601.04011 ; arXiv:1605.01288 , the focus is to establish fast rates in terms of risk minimization without a regularizer, i.e,
where is a -exp-concave function. As reasoned above, the fast rates in these studies do not carry over to the minimization problem (1) with an arbitrary regularizer.
Koren & Levy DBLP:conf/nips/KorenL15 studied the convergence of a penalized/regularized empirical risk minimizer by
They assumed that is a -strongly regularizer w.r.t the Euclidean norm and is bounded over . Gonen and Shalev-Shwartz DBLP:journals/corr/abs/1601.04011 focused on the risk minimization with generalized linear model :
which is a special case of the general minimization problem (1). Under the assumption that is a -exp-concave function of , they provided an expected convergence rate of the empirical risk minimizer, which is in the same order as the result in DBLP:conf/nips/KorenL15 , i.e., .
There are three differences between our work and these two works: (i) their results of fast rate are with respect to , where does not include any regularizer; in contrast our result is respect to ; (ii) the strongly convex penalization term in DBLP:conf/nips/KorenL15 is artificially added to the ERM to facilitate the analysis; in contrast the arbitrary convex regularizer in this paper is built into the objective; (iii) their fast rate guarantee is in expectation while our fast rate guarantee is in high probability. In light of these differences, we can see that our result is more general and much stronger. In particular, when setting in our problem, we obtain the fast rate with high probability of the empirical risk minimizer, which is only worse by a factor of than the in-expectation rate in DBLP:journals/corr/abs/1601.04011 ; DBLP:conf/nips/KorenL15 . Additionally, a similar high probability risk bound with respect to for any regularized empirical risk minimizer (4) can be easily derived in our framework as long as is convex and bounded over (see Theorem 2).
A more recent work by Mehta arXiv:1605.01288 establishes a fast rate of for the exp-concave ERM. His analysis for the empirical risk minimizer is based on the connection between exp-concavity and the stochastic mixability condition, and he exploited the heavy machinery developed in their previous work for fast rate analysis of empirical risk minimizer under the stochastic mixability condition DBLP:conf/nips/MehtaW14 . However, this analysis does not apply to the regularized empirical risk minimizer (4) with an arbitrary convex regularizer. Admittedly, arXiv:1605.01288 has made additional contribution on removing the factor by using boosting techniques to boost the in-expectation results of DBLP:conf/nips/KorenL15 ; DBLP:journals/corr/abs/1601.04011 .
We comment on the extra conditions on the loss functions . DBLP:conf/nips/KorenL15 requires that is a smooth function of for any and is bounded over . Both DBLP:journals/corr/abs/1601.04011 and arXiv:1605.01288 require that or to be Lipschitz continuous. Note that Lipschitz continuity implies bounded range of the loss function . In the present paper, we assume that is Lipschitz continuous and smooth over for all . Both conditions are necessary for us to deliver a simple analysis for exp-concave empirical minimization with an arbitrary convex regularizer. We also notice for a twice differentiable smooth and exp-concave function, Lipschitz continuity is automatically satisfied.
Next, we briefly mention several results regarding fast rates of ERM under strong convexity condition - a stronger condition than exp-concavity. Shalev-Shwartz et al. COLT:Shalev:2009 established an in-expectation convergence bound of ERM over any bounded convex set for (3), which requires each individual loss function to be a -strongly convex function of . However, in machine learning applications, individual loss functions are usually not strongly convex. Recently, Zhang et al. DBLP:journals/corr/0005YJ17 developed optimistic rates of ERM over a bounded convex set for (3), where they assumed to be non-negative and smooth. Under -strong convexity assumption of , they established a fast rate of with high probability and a faster rate of when the optimal risk is small and the number of samples is sufficiently large (). In NIPS2008_3400 , the authors considered the composite problem (1) with having a generalized linear form and established a fast rate with high probability of for -strongly convex objective.
We can also compare with stochastic approximation algorithms. Lan Lan:2010:Optimal presented an optimal method for solving (1) without the exp-concavity assumption, which employs a proximal mapping to handle and has a convergence rate of , where is related to the noise in the stochastic gradient. In contrast, the convergence rate of empirical minimizer shown in this work has a better dependence on . One may also apply the online-to-batch conversion to a variant of ONS that employs the proximal mapping to handle to obtain an convergence with high probability. Nonetheless, the resulting algorithm will be at least as expensive as ONS. Finally, it is worth mentioning that the linear dependence on the dimensionality of the convergence rate of the empirical minimizer for the minimization problem (1) over is unavoidable even with smooth functions NIPS2016_ERM .
In this section, we present some preliminaries. Let
denote a random variable following a distribution. Denote by the partial gradient in terms of . Define
Let denote the Euclidean norm of a vector. For a positive definite matrix , define the -norm and its dual norm . By Hölder’s inequality, we have .
A function is -exp-concave over the domain for some if the function is concave over . If is twice differentiable and -exp-concave, it follows that
A function is a -smooth function with respect to over if the following inequality holds that for all for some
A function is -Lipschitz continuous if .
We will make the following assumptions regarding the loss function and the regularizer .
We assume that (i) is a closed and bounded convex set, i.e., there exists such that for all . (ii) is a -Lipschitz continuous, -smooth and -exp-concave function of for any . (iii) is a convex function.
Remark 1: Note that if is twice differentiable, the smoothness and exp-concavity naturally imply Lipschitz continuity. This can be seen from (7) by noting that . As a result , which implies that is -Lipschitz continuous.
There are many machine learning problems satisfying the above assumptions. If we consider the loss function in supervised learning where denote a random feature vector and label, then the square loss, logistic loss, squared hinge loss are exp-concave function under appropriate conditions on . Let us consider the square loss as an example.
Example 1. Suppose and are bounded. W.l.g we can assume and . Then and . It then follows that for any and any ,
which guarantees that is a -exp-concave function of .
Next, we give an example showing that the sum of an exp-concave function and a convex function is not necessarily an exp-concave function.
Example 2. Let , and . To see is an exp-concave function, we can show that and , then for any and
To see is not an exp-concave function, we can show that and , then for any the following matrix is not positive semi-definite
which contracts to (7) if is an exp-concave function. As a result, is not an exp-concave function.
In our analysis, we will use the covering number of . A subset is called an -net of if for every one can find so that . The minimal cardinality of an -net of is called the covering number and denoted by . The covering number of the Euclidean ball
can be estimated using a standard volume comparison argumentConvex:body:89 , as follows The covering numbers are (almost) increasing by inclusion OneBit:Plan:LP : implies . Since , then
Finally, we present two basic lemmas, which will be useful in our analysis.
Let be an optimal solution to (1). Then for any we have
4 Main Result and Analysis
Remark 2: Note that when , we directly obtain a fast rate with high probability of the empirical risk minimizer for the exp-concave risk minimization problem (3). We can also obtain a similar result for (3) regarding the regularized empirical risk minimizer (4) that provides a different way for solving (3), which is usually preferred over solving the empirical risk minimization problem without any regularization due to that (i) a regularization can lead to a better condition from the perspective of optimization complexity; and (ii) the prior knowledge about the model can be encoded into the regularizer.
Remark 3: This new result not only addresses the open problem raised in DBLP:conf/nips/KorenL15 about the high probability bound for the strongly regularized empirical risk minimizer but also extends the fast rate to any regularized empirical risk minimizer as long as the regularizer is convex. In comparison, DBLP:conf/nips/KorenL15 only provides the expectational fast rate for the regularized empirical risk minimizer using a strongly convex regularizer. The additional assumption used in our analysis compared to DBLP:conf/nips/KorenL15 is the Lipschitz continuity of the loss functions over the domain , which is mild.
Suppose Assumptions 1 hold. For any we have
Let . We begin with the following inequality in Lemma 1
Let and taking expectation over both sides over the random variable , we have
Adding up the above inequality and the inequality in Lemma 2, we have
Adding on both sides and by the definition of , we can finish the proof.
Let . Under Assumption 1, with probability at least , for any , we have
To prove the above lemma. We need the following concentration result of random vectors.
Smale:learning . Let be a Hilbert space equipped with a norm and let be a random variable with values in . Assume almost surely. Denote . Let be () independent drawers of . For any , with confidence ,
To utilize the above lemma, we consider as a random variable in a Hilbert space equipped with a norm . Then we have . To prove Lemma 4, we need an upper bound of and . First, we note that , then . Second,
where denotes the trace function and the last inequality uses . Then, according to Proposition 1, with probability at least , we have
Under Assumptions 1, with probability at least , for any and any , we have
The proof of the above lemma is similar to the proof of Lemma 1 in DBLP:journals/corr/0005YJ17 and is deferred to supplement. The idea of the proof is that: first we establish an upper bound for a fixed using Proposition 1 and then use the union bound and the covering number of to get an upper bound for any . Then we utilize the property of the -net to prove the inequality in the lemma.
4.1 Proof of Theorem 1
Let and with . The values of and will be decided later.
where the first inequality uses the convexity of and the second inequality uses the optimality condition of , i.e., . Then we have
Next, we bound the last four terms in the R.H.S using Hölder’s inequality.
4.2 Proof of Theorem 2
where and . According to Theorem 1, the following inequality holds with high probability ,
Plugging the definition of we have
where the last inequality uses the assumption .
In this paper, we have developed a simple analysis of fast rats for empirical minimization with exponential concave loss functions and an arbitrary convex regularizer. This represents the first result of its kind. The proof is elementary only exploiting the covering number of a finite-dimensional bounded set and a concentration inequality of random vectors. Our framework also induces a unified fast rate results for exponential concave empirical risk minimization without and with any convex regularizer. An open problem remains is whether the factor can be removed without using the boosting technique.
-  N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
-  A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1646–1654, 2014.
-  V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems 29, pages 3576–3584. 2016.
-  A. Gonen and S. Shalev-Shwartz. Average stability is invariant to data preconditioning. implications to exp-concave empirical risk minimization. CoRR, abs/1601.04011, 2016.
-  E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
-  S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21 (NIPS), pages 801–808, 2008.
-  T. Koren and K. Y. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28 (NIPS), pages 1477–1485, 2015.
-  H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition, 2003.
-  G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 2010.
-  M. Mahdavi, L. Zhang, and R. Jin. Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of The 28th Conference on Learning Theory (COLT), pages 1305–1320, 2015.
-  N. A. Mehta. Fast rates with high probability in exp-concave statistical learning. ArXiv e-prints, arXiv:1605.01288, 2016.
-  N. A. Mehta and R. C. Williamson. From stochastic mixability to fast rates. In Advances in Neural Information Processing Systems 27 (NIPS), pages 1197–1205, 2014.
-  Y. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization. Kluwer Academic Publishers, 2004.
-  G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Tracts in Mathematics (No. 94). Cambridge University Press, 1989.
Y. Plan and R. Vershynin.
One-bit compressed sensing by linear programming.Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
-  S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
-  S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153–172, 2007.
-  K. Sridharan, S. Shalev-shwartz, and N. Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages 1545–1552, 2009.
-  V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
L. Xiao and T. Zhang.
A proximal stochastic gradient method with progressive variance reduction.SIAM Journal on Optimization, 24(4):2057–2075, 2014.
-  L. Zhang, T. Yang, and R. Jin. Empirical risk minimization for stochastic convex optimization: O(1/n)- and o(1/n)-type of risk bounds. CoRR, abs/1702.02030, 2017.
Appendix A Proof of Lemma 5
The proof is similar to the proof of Lemma 1 in . Denote by the -net of with minimal cardinality. By the covering number theory, we have . To prove the upper bound for all , we first consider a fixed point in the denoted by . Since is -smooth for any , we have