1 Introduction
The benefit of smoothness for obtaining faster convergence has been well established in the optimization literature. Sadly, many machine learning tasks are inherently nonsmooth, and thus do not inherit these favorable guarantees. In the nonsmooth setting, it is known that one can achieve better than the blackbox
rate for certain structured functions [25], including several (such as hinge loss, regression, etc.) that play a pivotal role in modern machine learning.In this paper, we are interested in developing faster methods for these important nonsmooth optimization problems, one such example being the classic problem of regression. As noted in [17], even achieving a linear dependence in has required careful handling of accelerated techniques for nonsmooth optimization [25, 33, 34]. In this work, we show how to go beyond these rates to achieve an iteration complexity that is sublinear in . We further extend these results to the setting of softmargin SVM, under various choices of regularization, again achieving iteration complexities that are sublinear in
. Additionally, by making use of efficient tensor methods
[28, 10], we establish overall computational complexity in terms of (periteration) linear system solves, thus providing results that may be compared with [12, 11, 17].The key observation of this work is that the softmax approximation to the max function, which we denote as (parameterized by ), is not only smooth (i.e., its gradient is Lipschitz), but also higherorder smooth. In particular, we establish Lipschitz continuity of its third derivative by ensuring a bound on its fourth derivative, with Lipschitz constant . By combining this observation with recent advances in higherorder acceleration [18, 19, 9, 10], we achieve an improved iteration complexity of , thus going beyond the previous dependence [25, 33, 34, 17].
After bringing together the higherorder smoothness of softmax with nearoptimal higherorder acceleration techniques, we arrive at the following results, beginning with regression.
Theorem 1.1.
Let for , s.t. , and let . There is a method, initialized with , that outputs such that
in iterations, where each iteration requires calls to a gradient oracle and solutions to linear systems of the form , for diagonal matrix , , and for some problemdependent parameter .
Our results are also applicable to softmargin SVMs, and so in particular, we get the following for SVM [8, 37, 23].
Theorem 1.2.
Let where , for , let , and let . There is a method, initialized with , that outputs such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, for some problemdependent parameter .
We emphasize that such rates were not known before, to the best of our knowledge. Furthermore, our stronger oracle model seems necessary for going beyond an dependence due to tight upper and lower bounds known for firstorder methods with proxoracle access, when the convex function is neither smooth nor strongly convex [36]. In addition, it is wellknown that some structured linear systems can be solved in nearlylinear time [35, 21], making the periteration complexity competitive with firstorder methods in such settings.
We also remark that determining the precise iteration complexities attainable under various higherorder oracle models and smoothness assumptions has been an incredibly active area of research [29, 26, 4, 24, 1, 18, 19, 9, 10], and so our results complement these by extending their reach to nonsmooth problems under higherorder oracle access.
1.1 Related work
Smooth approximation techniques:
It was shown by Nesterov [25] that one can go beyond the blackbox convergence of to achieve an rate for certain classes of nonsmooth functions. The main idea was to carefully smooth the wellstructured function, and the work goes on to present several applications of the method, including and regression, in addition to saddlepoint games. However, the methods for all of these examples incur an dependence which remains in several works that build upon these techniques [33, 34]. For a more comprehensive overview, we refer the reader to [6].
Higherorder accelerated methods:
Several works have considered accelerated variants of optimization methods based on access to higherorder derivative information. Nesterov [26] showed that one can accelerate cubic regularization, under a Lipschitz Hessian condition, to attain faster convergence, and these results were later generalized by Baes [4] to arbitrary higherorder oracle access under the appropriate notions of (higherorder) smoothness. The rate attained in [26] was further improved upon by Monteiro and Svaiter [24], and lower bounds have established that the oracle complexity of this result is nearly tight (up to logarithmic factors) when the Hessian is Lipschitz [3]. Until recently, however, it was an open question whether these lower bounds are tight for general higherorder oracle access (and smoothness), though this question has been mostly resolved as a result of several works developed over the past year [18, 19, 9, 10].
regression:
Various regression problems play a central role in numerous computational and machine learning tasks. Designing better methods for regression in particular has led to faster approximate max flow algorithms [12, 11, 20, 33, 34]. Recently, Ene and Vladu [17] presented a method for regression, based on iteratively reweighted least squares, that achieves an iteration complexity of . We note that their rate of convergence has an dependence, whereas our result, in contrast, includes a diameter term, i.e., .
Softmargin SVM:
Support vector machines (SVMs) [14] have enjoyed widespread adoption for classification tasks in machine learning [15]. For the softmargin version, several approaches have been proposed for dealing with the nonsmooth nature of the hinge loss. The standard approach is to cast the (regularized) SVM problem as a quadratic programming problem [31, 7]. Stochastic subgradient methods have also been successful due to their advantage in periteration cost [32]. While SVM is arguably the most wellknown variant, SVMs, for general , have also been studied [8]. SVMs [37, 23] are appealing, in particular, due to their sparcityinducing tendencies, though they forfeit the strong convexity guarantees that come with regularization [2].
Interiorpoint methods:
It is wellknown that both regression and
SVM can be expressed as linear programs
[7, 8], and thus are amenable to fast LP solvers [22, 13]. In particular, this means that each can be solved in either time (where is the matrix multiplication constant) [13], or in linear system solves [22]. We note that, while these methods dominate in the lowerror regime, our method is competitive, under modest choices of and favorable linear system solves, when the diameter term (up to logarithmic factors).1.2 Our contributions
The main contributions of this work are as follows:

We provide improved higherorder oracle complexity for several important nonsmooth optimization problems, by combining nearoptimal higherorder acceleration with the appropriate smooth approximations.
We further stress that the convergence guarantees presented in this work surpass the tight upper and lower bounds known under firstorder and proxoracle access, for nonsmooth and nonstrongly convex functions [36]. Thus, we observe that higherorder oracle access provides an advantage not only for functions that are sufficiently smooth, but also in the nonsmooth setting.
In addition, we wish to note the importance of relying on more recent advances in nearoptimal higherorder acceleration [18, 19, 9, 10]. We may recall in particular that the higherorder acceleration scheme in [4] achieves a rate of (assuming derivative is Lipschitz). Thus, for the case of (whereby ), this approach would not improve upon the previous dependence since, roughly speaking, we would only expect to recover a rate of .
While one may also consider repeatedly applying Gaussian smoothing to induce higherorder smoothness, this approach suffers from two primary drawbacks: (1) a straightforward application would incur an additional term, and (2) it would become necessary to compute higherorder derivatives of the Gaussiansmoothed function.
2 Setup
Let denote vectors in . Throughout, we let denote the th coordinate of , and we let for . We let denote the Hadamard product, i.e., for all . Furthermore, we will define and . We let denote the dimensional simplex. We let denote the standard norm, and we drop the subscript to let denote the norm. Let be a symmetric positivedefinite matrix, i.e., . Then, we may define the matrixinduced norm of (w.r.t. ) as , and we let .
We now make formal a (higherorder) notion of smoothness. Specifically, for , we say a times differentiable function is smooth (of order ) w.r.t. if the derivative is Lipschitz continuous, i.e., for all ,
(1) 
where we define
and where
Observe that, for , this recovers the usual notion of smoothness, and so our convention will be to refer to firstorder smooth functions as simply smooth. A complementary notion is that of strong convexity, and its higherorder generalization known as uniform convexity [26]. In particular, is uniformly convex (of order ) with respect to if, for all ,
Again, we may see that this captures the typical strong convexity (w.r.t. ) by setting .
3 Softmax approximation and regression
We recall from [25, 34] the standard softmax approximation, for :
(2) 
It is straightforward to observe that (2) is smooth, and furthermore that it smoothly approximates the max function, i.e., .
Fact 3.1.
For all ,
(3) 
Note that this approximation can be used for , since , and . It follows that we may determine a smooth approximation of regression, i.e.,
(4) 
as , where and .
Having now formalized the connection between and , we assume throughout the rest of the paper that and , as the difference in dimension between , and , only affects the final convergence by a constant factor. In addition, we will assume that is such that , and thus we consider the regime where .
3.1 Softmax calculus
To simplify notation, we let , and so . Note that we have
(5) 
Furthermore, since for all , it follows that, for all ,
(6) 
We may also see that
(7) 
Since is a symmetric bilinear form for all , it follows that, for all ,
(8) 
3.2 Higherorder smoothness
As mentioned previously, one of the key observations of this work is that softmax is equipped with favorable higherorder smoothness properties. We begin by showing a bound on its fourth derivative, as established by the following lemma, and we provide its proof in the appendix.
Lemma 3.2.
For all , ,
(9) 
It will also be helpful to note the following standard result on how a bound on the fourth derivative implies Lipschitzcontinuity of the third derivative.
Lemma 3.3.
Let be a times differentiable function, let and be such that , and suppose, for all ,
(10) 
Then we have that, for all ,
(11) 
Having determined these bounds, we now provide smoothness guarantees for the softmax approximation to regression.
Theorem 3.4.
Let . Then, is (order 3) smooth w.r.t. .
4 Higherorder acceleration
We now rely on recent techniques for nearoptimal higherorder acceleration [18, 19, 9, 10]. For these higherorder iterative methods, assuming is (order ) smooth, the basic idea for each iteration is to determine a minimizer of the subproblem given by the order Taylor expansion , centered around the current iterate , plus a order regularization term, i.e.,
(12) 
where, for all ,
(13) 
Given access to such an oracle, it is possible to combine it with a carefullytuned accelerated scheme to achieve an improved iteration complexity when the derivative is Lipschitz. In contrast to the higherorder accelerated scheme of Nesterov [26] (later generalized by Baes [4]), these nearoptimal rates rely on a certain additional binary search procedure, as first observed by Monteiro and Svaiter [24].
In particular, we are motivated by the method [10], whereby we provide a sketch of the algorithm here. Note that, for the sake of clarity, various approximations found in the precise algorithm have been omitted, and we refer the reader to [10] for the complete presentation.
As established by Bullins [10], provides us with the following guarantee.
Theorem 4.1 ([10], Theorem 4.1).
Suppose is (order 3) smooth w.r.t. for . Then, finds a point such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, and where is a polynomial in various problemdependent parameters.
Given this result, we have the following corollary which will be useful for our smoothed minimization problem.
Corollary 4.2.
Let be the softmax approximation to (4) for , where is such that . Then, letting , finds a point such that
in iterations, where each iteration requires calls to a gradient oracle and solutions to linear systems of the form , for diagonal matrix and .
We are now equipped with the tools necessary for proving Theorem 1.1.
5 Softmargin SVM
In this section we shift our focus to consider various instances of softmargin SVM. It is known that in the case, an improved rate of is possible [30, 27, 2], and so we give the first sub rate for variants of SVM that are both nonsmooth and nonstrongly convex. In Section 5.1, we handle regularization, and in Section 5.2, we consider the case of higherorder regularizers.
5.1 regularized SVM
We begin with regularized softmargin SVM (SVM), i.e.,
(14) 
for , (), and . To simplify the notation, we define . Letting and
(15) 
we may then rewrite . We now make the following observations concerning softmaxbased approximations for and .
Lemma 5.1 ( approximation).
Let for , and let for . Then, we have that
(16) 
Lemma 5.2 (Smooth hinge loss approximation).
Let for . Then
(17) 
This gives us a natural smooth approximation to , namely,
(18) 
Taken together with these approximations, we arrive at the following lemma, the proof of which follows by combining Lemmas 5.1 and 5.2.
Lemma 5.3.
Let , and let be as in (14). Then, for all ,
(19) 
As was the case for regression, in order to make use of the guarantees provided by , we must first show higherorder smoothness, and so we have following theorem.
Theorem 5.4.
Let . Then, is (order 3) smooth w.r.t. , for .
Corollary 5.5.
Let be the smooth approximation to (as in (14)) with for . Then, letting , finds a point such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, and where is a polynomial in various problemdependent parameters.
5.2 Higherorder regularization
The softmargin SVM model has been studied with various choices of regularization beyond and [8]. Just as introducing strong convexity can lead to faster convergence for SVM [2], we may see that a similar advantage may be obtained for an appropriately chosen regularizer that is uniformly convex. More concretely, if we consider the regularized softmargin SVM for , we are able to use the following theorem from [10] which holds for functions that are both higherorder smooth and uniformly convex.
Theorem 5.6 ([10], Theorem 4.2).
Suppose is (order 3) smooth and (order 4) uniformly convex w.r.t. , let , and let . Then, with the appropriate restarting procedure, finds a point such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, and where is a polynomial in various problemdependent parameters.
Remark 5.7.
While the choice of may appear arbitrary, we note that the fourthorder regularized Taylor models (eq.(13), for ) permit efficiently computable (approximate) solutions [28, 10], and developing efficient tensor methods for subproblems beyond the fourthorder model remains an interesting open problem.
Thus, we may consider SVM, i.e., (as presented in [8]), along with its smooth counterpart , which brings us to the following corollary and theorem for SVM.
Corollary 5.8.
Let be the smooth approximation to with for . Then, letting , and with the appropriate restarting procedure, finds a point such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, and where is a polynomial in various problemdependent parameters.
Theorem 5.9.
Let where , for , and let . There is a method, initialized with , that outputs such that
in iterations, where each iteration requires calls to a gradient oracle and linear system solver, for some problemdependent parameter .
While we acknowledge that this result is limited to the (less common) case of SVM (see Remark 5.7), we include it here to illustrate the iteration complexity improvement, from to , under additional uniform convexity guarantees, similar to the improvement gained for strongly convex nonsmooth problems [5, 2].
6 Conclusion
In this work, we have shown how to harness the power of higherorder acceleration for faster nonsmooth optimization. While we have focused primarily on convex optimization, one potential direction would be to investigate if these techniques can extend to the nonsmooth nonconvex setting. Although it is not possible in general to guarantee convergence to a firstorder critical point (i.e., ) for nonsmooth problems, recent work has consider a relaxed version of the nonconvex problem with a Moreau envelopebased smoothing [16]. Improving max flow would be another interesting future direction, perhaps by connecting these higherorder techniques with the results in [33, 34].
Acknowledgements
We thank Sébastien Bubeck, Yin Tat Lee, Sushant Sachdeva, Cyril Zhang, and Yi Zhang for numerous helpful conversations. BB is supported by Elad Hazan’s NSF grant CCF1704860. RP is partially supported by the National Science Foundation under Grant # 1718533.
References

[1]
N. Agarwal, Z. AllenZhu, B. Bullins, E. Hazan, and T. Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1195–1199. ACM, 2017.  [2] Z. AllenZhu and E. Hazan. Optimal blackbox reductions between optimization objectives. In Advances in Neural Information Processing Systems, pages 1614–1622, 2016.
 [3] Y. Arjevani, O. Shamir, and R. Shiff. Oracle complexity of secondorder methods for smooth convex optimization. Mathematical Programming, pages 1–34, 2018.
 [4] M. Baes. Estimate sequence methods: extensions and approximations. 2009.
 [5] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 [6] A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557–580, 2012.
 [7] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
 [8] P. S. Bradley and O. Mangasarian. Feature selection via concave minimization and support vector machines. In International Conference on Machine Learning, pages 82–90, 1998.
 [9] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Nearoptimal method for highly smooth convex optimization. arXiv preprint arXiv:1812.08026, 2018.
 [10] B. Bullins. Fast minimization of structured convex quartics. arXiv preprint arXiv:1812.10349, 2018.
 [11] H. H. Chin, A. Madry, G. L. Miller, and R. Peng. Runtime guarantees for regression problems. In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, pages 269–282. ACM, 2013.
 [12] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S.H. Teng. Electrical flows, Laplacian systems, and faster approximation of maximum flow in undirected graphs. In Proceedings of the FortyThird Annual ACM Symposium on Theory of Computing, pages 273–282. ACM, 2011.
 [13] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication time. arXiv preprint arXiv:1810.07896, 2018.
 [14] C. Cortes and V. Vapnik. Supportvector networks. Machine Learning, 20(3):273–297, 1995.
 [15] N. Cristianini, J. ShaweTaylor, et al. An introduction to support vector machines and other kernelbased learning methods. Cambridge university press, 2000.
 [16] D. Davis and D. Drusvyatskiy. Complexity of finding nearstationary points of convex functions stochastically. arXiv preprint arXiv:1802.08556, 2018.
 [17] A. Ene and A. Vladu. Improved convergence for and regression via iteratively reweighted least squares. In International Conference on Machine Learning, 2019.
 [18] A. Gasnikov, P. Dvurechensky, E. Gorbunov, D. Kovalev, A. Mohhamed, E. Chernousova, and C. A. Uribe. The global rate of convergence for optimal tensor methods in smooth convex optimization. arXiv preprint arXiv:1809.00382 (v10), 2018.
 [19] B. Jiang, H. Wang, and S. Zhang. An optimal highorder tensor method for convex optimization. arXiv preprint arXiv:1812.06557, 2018.
 [20] J. A. Kelner, Y. T. Lee, L. Orecchia, and A. Sidford. An almostlineartime algorithm for approximate max flow in undirected graphs, and its multicommodity generalizations. In Proceedings of the twentyfifth annual ACMSIAM symposium on Discrete algorithms, pages 217–226. SIAM, 2014.
 [21] I. Koutis, G. L. Miller, and R. Peng. A fast solver for a class of linear systems. Communications of the ACM, 55(10):99–107, 2012.
 [22] Y. T. Lee and A. Sidford. Path finding methods for linear programming: Solving linear programs in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 424–433, 2014.
 [23] O. L. Mangasarian. Exact 1norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Research, 7(Jul):1517–1530, 2006.
 [24] R. D. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to secondorder methods. SIAM Journal on Optimization, 23(2):1092–1125, 2013.
 [25] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103(1):127–152, 2005.
 [26] Y. Nesterov. Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
 [27] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
 [28] Y. Nesterov. Implementable tensor methods in unconstrained convex optimization. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2018.
 [29] Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 [30] F. Orabona, A. Argyriou, and N. Srebro. Prisma: Proximal iterative smoothing algorithm. arXiv preprint arXiv:1206.2372, 2012.
 [31] J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSRTR9814, April 1998.
 [32] S. ShalevShwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical Programming, 127(1):3–30, 2011.
 [33] J. Sherman. Areaconvexity, regularization, and undirected multicommodity flow. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 452–460. ACM, 2017.
 [34] A. Sidford and K. Tian. Coordinate methods for accelerating regression and faster approximate maximum flow. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pages 922–933, 2018.
 [35] D. A. Spielman and S.H. Teng. Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the ThirtySixth Annual ACM Symposium on Theory of Computing, pages 81–90. ACM, 2004.
 [36] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647, 2016.
 [37] J. Zhu, S. Rosset, R. Tibshirani, and T. J. Hastie. 1norm support vector machines. In Advances in Neural Information Processing Systems, pages 49–56, 2004.
Appendix A Proofs
a.1 Proof of Lemma 3.2
Proof of Lemma 3.2.
It follows from (8) that, for all ,
(20) 
and
(21) 
where the second and fourth inequalities follow from Hölder’s inequality and (6), respectively. We may similarly see that
(22) 
By taking the derivative of with respect to , for , we have that
(23) 
and so for any ,
(24) 
This implies that
(25) 
and so we may bound by
(26) 
Furthermore, since by (23) we have that
it follows that
Comments
There are no comments yet.