1 Introduction
Optimization is at the core of machine learning and many other fields of applied research, for instance operations research, optimal control, and deep learning. The latter fields have embraced frameworks that combine a modeling language with only a few optimization solvers; interior point solvers in operations research and stochastic gradient descent (SGD) and variants thereof in deep learning frameworks like TensorFlow, PyTorch, or Caffe. That is in stark contrast to classical (i.e., nondeep) machine learning, where new problems are often accompanied by new optimization algorithms and their implementation. However, designing and implementing optimization algorithms is still a timeconsuming and errorprone task.
The lack of an optimization framework for classical machine learning problems can be explained partially by the common belief, that any efficient solver needs to exploit problem specific structure. Here, we challenge this common belief.
We introduce GENO (GENeric Optimization), an optimization framework that allows to state optimization problems in an easytoread modeling language. From the specification an optimizer is automatically generated by using automatic differentiation on a symbolic level. The optimizer combines a quasiNewton solver with an augmented Lagrangian approach for handling constraints.
Any generic modeling language plus solver approach frees the user from tedious implementation aspects and allows to focus on modeling aspects of the problem at hand. However, it is required that the solver is efficient and accurate. Contrary to common belief, we show here that the solvers generated by GENO are (1) as efficient as wellengineered, specialized solvers at the same or better accuracy, (2) more efficient by a decent margin than recent stateoftheart solvers, and (3) orders of magnitude more efficient than classical modeling language plus solver approaches.
Related work.
Classical machine learning is typically served by toolboxes like scikitlearn scikitlearn , Weka weka2 , and MLlib mllib . These toolboxes mainly serve as wrappers for a collection of wellengineered implementations of standard solvers like LIBSVM ChangL01
for support vector machines or glmnet
FriedmanHT09 for generalized linear models. A disadvantage of the toolbox approach is a lacking of flexibility. An only slightly changed model, for instance by adding a nonnegativity constraint, might already be missing in the framework.Modeling languages provide more flexibility since they allow to specify problems from large problem classes. Popular modeling languages for optimization are CVX cvx ; cvx2 for MATLAB and its Python extension CVXPY cvxpy2 ; cvxpy , and JuMP jump which is bound to Julia. In the operations research community AMPL fourerGK03 and GAMS Gams
have been used for many years. All these languages take an instance of an optimization problem and transform it into some standard form of a linear program (LP), quadratic program (QP), secondorder cone program (SOCP), or semidefinite program (SDP). The transformed problems is then addressed by solvers for the corresponding standard form. However, the transformation into standard form can be inefficient, because the formal representation in standard form can grow substantially with the problem size. This representational inefficiency directly translates into computational inefficiency.
The modeling language plus solver paradigm has been made deployable in the CVXGEN Cvxgen , QPgen qpgen , and OSQP osqpgen projects. In these projects code is generated for the specified problem class. However, the problem dimension and sometimes the underlying sparsity pattern of the data needs to be fixed. Thus, the size of the generated code still grows with a growing problem dimension. All these projects are targeted at embedded systems and are optimized for small or sparse problems. The underlying solvers are based on Newtontype methods that solve a Newton system of equations by direct methods. Solving these systems is efficient only for small problems or problems where the sparsity structure of the Hessian can be exploited in the Cholesky factorization. Neither condition is typically met in standard machine learning problems.
Deep learning frameworks like TensorFlow tf , PyTorch pytorch , or Caffe caffe
are efficient and fairly flexible. However, they target only deep learning problems that are typically unconstrained problems that ask to optimize a separable sum of loss functions. Algorithmically, deep learning frameworks usually employ some form of stochastic gradient descent (SGD)
robbins1951 , the rationale being that computing the full gradient is too slow and actually not necessary. A drawback of SGDtype algorithms is that they need careful parameter tuning of, for instance, the learning rate or, for accelerated SGD, the momentum. Parameter tuning is a timeconsuming and often datadependent task. A noncareful choice of these parameters can turn the algorithm slow or even cause it to diverge. Also, SGD type algorithms cannot handle constraints.GENO, the framework that we present here, differs from the standard modeling language plus solver approach by a much tighter coupling of the language and the solver. GENO does not transform problem instances but whole problem classes, including constrained problems, into a very general standard form. Since the standard form is independent of any specific problem instance it does not grow for larger instances. GENO does not require the user to tune parameters and the generated code is highly efficient.
handwritten  TensorFlow,  Weka,  CVXPY  GENO  

solver  PyTorch  Scikitlearn  
flexible  ✗  ✓  ✗  ✓  ✓ 
efficient  ✓  ✓  ✓  ✗  ✓ 
deployable / standalone  ✓  ✗  ✗  ✗  ✓ 
can accommodate constraints  ✓  ✗  ✓  ✓  ✓ 
parameter free (learning rate, …)  ✗/✓  ✗  ✓  ✓  ✓ 
allows nonconvex problems  ✓  ✓  ✓  ✗  ✓ 
2 The GENO Pipeline
GENO features a modeling language and a solver that are tightly coupled. The modeling language allows to specify a whole class of optimization problems in terms of an objective function and constraints that are given as vectorized linear algebra expressions. Neither the objective function nor the constraints need to be differentiable. Nondifferentiable problems are transformed into constrained, differentiable problems. A general purpose solver for constrained, differentiable problems is then instantiated with the objective function, the constraint functions and their respective gradients. The gradients are computed by the matrix calculus algorithm that has been recently published in LaueMG18 . The tight integration of the modeling language and the solver is possible only because of this recent progress in computing derivatives of vectorized linear algebra expressions.
Generating a solver takes only a few milliseconds. Once it has been generated the solver can be used like any handwritten solver for every instance of the specified problem class. An online interface to the GENO framework can be found at http://www.genoproject.org.
2.1 Modeling Language
A GENO specification has four blocks (cf. the example to the right that shows an norm minimization problem from compressed sensing where the signal is known to be an element from the unit simplex.): (1) Declaration of the problem parameters that can be of type Matrix, Vector, or Scalar, (2) declaration of the optimization variables that also can be of type Matrix, Vector, or Scalar, (3) specification of the objective function in a MATLABlike syntax, and finally (4) specification of the constraints, also in a MATLABlike syntax that supports the following operators and functions: +, , *, /, .*, ./, , , log, exp, sin, cos, tanh, abs, norm1, norm2, sum, tr, det, inv. The set of operators and functions can be expanded when needed.
[frame=single] parameters Matrix A Vector b variables Vector x min norm1(x) st A*x == b sum(x) == 1 x >= 0
Note that in contrast to instancebased modeling languages like CVXPY no dimensions have to be specified. Also, the specified problems do not need to be convex. In the nonconvex case, only a local optimal solution will be computed.
2.2 Generic Optimizer
At its core, GENO’s generic optimizer is a solver for unconstrained, smooth optimization problems. This solver is then extended to handle also nonsmooth and constrained problems. In the following we first describe the smooth, unconstrained solver before we detail how it is extended to handling nonsmooth and constrained optimization problems.
Solver for unconstrained, smooth problems.
There exist quite a number of algorithms for unconstrained optimization. Since in our approach we target problems with a few dozen up to a few million variables, we decided to build on a firstorder method. This still leaves many options. Nesterov’s method Nesterov83 has an optimal theoretical running time, that is, its asymptotic running time matches the lower bounds in in the smooth, convex case and in the strongly convex case with optimal dependence on the Lipschitz constants and that have to be known in advance. Here and
are upper and lower bounds, respectively, on the eigenvalues of the Hessian. On quadratic problems quasiNewton methods share the same optimal convergence guarantee
huang70 ; Nazareth79 without requiring the values for these parameters. In practice, quasiNewton methods often outperform Nesterov’s method, although they cannot beat it in the worst case. It is important to keep in mind that theoretical running time guarantees do not always translate into good performance in practice. For instance, even the simple subgradient method has been shown to have a convergence guarantee in on strongly convex problems Goffin77 , but it is certainly not competitive on real world problems.Hence, we settled on a quasiNewton method and implemented the wellestablished LBFGSB algorithm ByrdLNZ95 ; ZhuBLN97 that can also handle box constraints on the variables. It serves as the solver for unconstrained, smooth problems. The algorithm combines the standard limited memory quasiNewton method with a projected gradient path approach. In each iteration, the gradient path is projected onto the box constraints and the quadratic function based on the secondorder approximation (LBFGS) of the Hessian is minimized along this path. All variables that are at their boundaries are fixed and only the remaining free variables are optimized using the secondorder approximation. Any solution that is not within the bound constraints is projected back onto the feasible set by a simple min/max operation MoralesN11 . Only in rare cases, a projected point does not form a descent direction. In this case, instead of using the projected point, one picks the best point that is still feasible along the ray towards the solution of the quadratic approximation. Then, a line search is performed for satisfying the strong Wolfe conditions Wolfe69 ; Wolfe71 . This condition is necessary for ensuring convergence also in the nonconvex case. The line search also obliterates the need for a step length or learning rate that is usually necessary in SGD, subgradient algorithms, or Nesterov’s method. Here, we use the line search proposed in MoreT94 which we enhanced by a simple backtracking line search in case the solver enters a region where the function is not defined.
Solver for unconstrained nonsmooth problems.
Machine learning often entails nonsmooth optimization problems, for instance all problems that employ regularization. Proximal gradient methods are a general technique for addressing such problems Proximal15 . Here, we pursue a different approach. All nonsmooth convex optimization problems that are allowed by our modeling language can be written as with smooth functions Nesterov05 . This class is flexible enough to accommodate most of the nonsmooth objective functions encountered in machine learning. All problems in this class can be transformed into constrained, smooth problems of the form
The transformed problems can then be solved by the solver for constrained, smooth optimization problems that we describe next.
Solver for smooth constrained problems.
There also quite a few options for solving smooth, constrained problems, among them projected gradient methods, the alternating direction method of multipliers (ADMM) boydADMM ; Gabay1976 ; Glowinski1975 , and the augmented Lagrangian approach Hestenes69 ; Powell69 . For GENO, we decided to follow the augmented Lagrangian approach, because this allows us to (re)use our solver for smooth, unconstrained problems directly. Also, the augmented Lagrangian approach is more generic than ADMM. All ADMMtype methods need a proximal operator that cannot be derived automatically from the problem specification and a closedform solution is sometimes not easy to compute. Typically, one uses standard duality theory for deriving the proxoperator. In Proximal15 , proxoperators are tabulated for several functions.
The augmented Lagrangian method can be used for solving the following general standard form of an abstract constrained optimization problem
(1) 
where , , , are differentiable functions, and the equality and inequality constraints are understood componentwise.
The augmented Lagrangian of Problem (1) is the following function
where and are Lagrange multipliers, is a constant, denotes the Euclidean norm, and denotes . The Lagrange multipliers are also referred to as dual variables. In principle, the augmented Lagrangian is the standard Lagrangian of Problem (1) augmented with a quadratic penalty term. This term provides increased stability during the optimization process which can be seen for example in the case that Problem (1) is a linear program. Note, that whenever Problem (1) is convex, i.e., are affine functions and are convex in each component, then the augmented Lagrangian is also a convex function.
The Augmented Lagrangian Algorithm 1 runs in iterations. In each iteration it solves an unconstrained smooth optimization problem. Upon convergence, it will return an approximate solution to the original problem along with an approximate solution of the Lagrange multipliers for the dual problem. If Problem (1) is convex, then the algorithm returns the global optimal solution. Otherwise, it returns a local optimum Bertsekas99 . The update of the multiplier can be ignored and the algorithm still converges Bertsekas99 . However, in practice it is beneficial to increase it depending on the progress in satisfying the constraints Birgin14 . If the infinity norm of the constraint violation decreases by a factor less than in one iteration, then is multiplied by a factor of two.
3 Limitations
While GENO is very general and efficient, as we will demonstrate in the experimental Section 4, it also has some limitations that we discuss here. For small problems, i.e., problems with only a few dozen variables, Newtontype methods with a direct solver for the Newton system can be even faster. GENO also does not target deep learning applications, where gradients do not need to be computed fully but can be sampled.
Some problems can pose numerical problems, for instance problems containing an operator might cause an overflow/underflow. However, this is a problem that is faced by all frameworks. It is usually addressed by introducing special operators like logsumexp.
Furthermore, GENO does not perform sanity checks on the provided input. Any syntactically correct problem specification is accepted by GENO as a valid input. For example, , where is a vector, is a valid expression. But the determinant of the outer product will always be zero and hence, taking the logarithm will fail. It lies within the responsibility of the user to make sure that expressions are mathematically valid.
4 Experiments
We conducted a number of experiments to show the wide applicability and efficiency of our approach. For the experiments we have chosen classical problems that come with established wellengineered solvers like logistic regression or elastic net regression, but also problems and algorithms that have been published at NeurIPS and ICML only within the last few years. The experiments cover smooth unconstrained problems as well as constrained, and nonsmooth problems. To prevent a bias towards GENO, we always used the original code for the competing methods and followed the experimental setup in the papers where these methods have been introduced. We ran the experiments on standard data sets from the LIBSVM data set repository, and, in some cases, on synthetic data sets on which competing methods had been evaluated in the corresponding papers.
Specifically, our experiments cover the following problems and solvers:  and
regularized logistic regression, support vector machines, elastic net regression, nonnegative least squares, symmetric nonnegative matrix factorization, problems from nonconvex optimization, and compressed sensing. Among other algorithms, we compared against a trustregion Newton method with conjugate gradient descent for solving the Newton system, sequential minimal optimization (SMO), dual coordinate descent, proximal methods including ADMM and variants thereof, interior point methods, accelerated and variance reduced variants of SGD, and Nesterov’s optimal gradient descent. Please refer to the appendix for more details on the solvers and GENO models.
Our test machine was equipped with an eightcore Intel Xeon CPU E52643 and 256GB RAM. As software environment we used Python 3.6, along with NumPy 1.16, SciPy 1.2, and scikitlearn 0.20. In some cases the original code of competing methods was written and run in MATLAB R2019. The solvers generated by GENO spent between and of their time on evaluating function values and gradients. Here, these evaluations essentially reduce to evaluating linear algebra expressions. Since all libraries are linked against the Intel MKL, running times of the GENO solvers are essentially the same in both environments, Python and MATLAB, respectively.
4.1 Regularization Path for regularized Logistic Regression
Computing the regularization path of the regularized logistic regression problem Cox58 is a classical machine learning problem, and only boring at a first glance. The problem is well suited for demonstrating the importance of both aspects of our approach, namely flexibility and efficiency. As a standard problem it is covered in scikitlearn. The scikitlearn implementation features the SAGA algorithm DefazioBL14 for computing the whole regularization path that is shown in Figure 1. This figure can also be found on the scikitlearn website ^{1}^{1}1https://scikitlearn.org/stable/auto_examples/linear_model/plot_logistic_path.html. However, when using GENO, the regularization path looks different, see also Figure 1
. Checking the objective functions values reveals that the precision of the SAGA algorithm is not enough for tracking the path faithfully. GENO’s result can be reproduced by using CVXPY except for one outlier at which CVXPY did not compute the optimal solution. LIBLINEAR
FanCHWL08 ; ZhuangJYL18 can also be used for computing the regularization path, but also fails to follow the exact path. This can be explained as follows: LIBLINEAR also does not compute optimal solutions, but more importantly, in contrast to the original formulation, it penalizes the bias for algorithmic reasons. Thus, changing the problem slightly can lead to fairly different results.CVXPY, like GENO, is flexible and precise enough to accommodate the original problem formulation and to closely track the regularization path. But it is not as efficient as GENO. On the problem used in Figure 1 SAGA takes 4.3 seconds, the GENO solver takes 0.5 seconds, CVXPY takes 13.5 seconds, and LIBLINEAR takes 0.05 seconds but for a slightly different problem and insufficient accuracy.
4.2 regularized Logistic Regression
Logistic regression is probably the most popular linear, binary classification method. It is given by the following unconstrained optimization problem with a smooth objective function
where is a data matrix, is a label vector, and is the regularization parameter. Since it is a classical problem there exist many wellengineered solvers for regularized logistic regression. The problem also serves as a testbed for new algorithms. We compared GENO to the parallel version of LIBLINEAR and a number of recently developed algorithms and their implementations, namely PointSAGA Defazio16 , SDCA ShalevShwartz13 , and catalyst SDCA LinMH15 ). The latter algorithms implement some form of SGD. Thus their running time heavily depends on the values for the learning rate (step size) and the momentum parameter in the case of accelerated SGD. The best parameter setting often depends on the regularization parameter and the data set. We have used the code provided by Defazio16 and the parameter settings therein.
For our experiments we set the regularization parameter and used real world data sets that are commonly used in experiments involving logistic regression. GENO converges as rapidly as LIBLINEAR and outperforms any of the recently published solvers by a good margin, see Figure 2.
On substantially smaller data sets we also compared GENO to CVXPY with both the ECOS Ecos1 and the SCS solver scs . As can be seen from Table 2, GENO is orders of magnitude faster.
Solver  Data sets  

heart  ionosphere  breastcancer  australian  diabetes  a1a  a5a  
GENO  0.005  0.013  0.004  0.014  0.006  0.023  0.062 
ECOS  1.999  2.775  5.080  5.380  5.881  12.606  57.467 
SCS  2.589  3.330  6.224  6.578  6.743  16.361  87.904 
4.3 Support Vector Machines
Support Vector Machines (SVMs) CortesV95 have been studied intensively and are widely used, especially in combination with kernels SchoelkopfS01 . They remain populat, as is indicated by the still rising citation count of the popular and heavilycited solver LIBSVM ChangL01 . The dual formulation of an SVM is given as the following quadratic optimization problem
where is a kernel matrix, is a binary label vector, is the regularization parameter, and is the elementwise multiplication. While the SVM problem with a kernel can also be solved in the primal chapelle2007 , it is traditionally solved in the dual. We use a Gaussian kernel, i.e., and standard data sets. We set the bandwith parameter which corresponds to roughly the median of the pairwise data point distances and set . Table 3 shows that the solver generated by GENO is as efficient as LIBSVM which has been maintained and improved over the last 15 years. Both solvers outperform general purpose approaches like CVXPY with OSQP osqp , SCS scs , Gurobi gurobi , or Mosek mosek by a few orders of magnitude.
Solver  Datasets  

ionosphere  australian  diabetes  a1a  a5a  a9a  w8a  codrna  
GENO  0.009  0.024  0.039  0.078  1.6  30.0  25.7  102.1 
LIBSVM  0.005  0.010  0.009  0.088  1.0  18.0  78.6  193.1 
SCS  0.442  1.461  3.416  11.707  517.5       
OSQP  0.115  0.425  0.644  3.384  168.2       
Gurobi  0.234  0.768  0.992  4.307  184.4       
Mosek  0.378  0.957  1.213  6.254  152.7       
4.4 Elastic Net
Elastic net regression ZouH05 has also been studied intensively and is used mainly for mircoarray data classification and gene selection. Given some data and a response , elastic net regression seeks to minimize
where and are the corresponding elastic net regularization parameters. The most popular solver is glmnet, a dual coordinate descent approach that has been implementated in Fortran FriedmanHT09 . In our experiments, we follow the same setup as in FriedmanHT09 . We generated Gaussian data with data points and features. The outcome values were generated by
where , , and is chosen such that the signaltonoise ratio is 3. We varied the number of data points and the number of features . The results are shown in Table 4. It can be seen that the solver generated by GENO is as efficient as glment and orders of magnitude faster than comparable stateoftheart general purpose approaches like CVXPY coupled with ECOS, SCS, Gurobi, or Mosek. Note, that the OSQP solver could not be run on this problem since CVXPY raised the error that it cannot convert this problem into a QP.
m  n  Solvers  

GENO  glmnet  ECOS  SCS  Gurobi  Mosek  
1000  1000  0.11  0.10  43.27  2.33  21.14  1.77 
2000  1000  0.14  0.08  202.04  9.24  58.44  3.52 
3000  1000  0.18  0.08  513.78  22.86  114.79  5.38 
4000  1000  0.21  0.09    38.90  185.79  7.15 
5000  1000  0.27  0.11    13.88  151.08  8.69 
1000  5000  1.74  0.62    28.69    13.06 
2000  5000  1.49  1.41    45.79    27.69 
3000  5000  1.58  2.02    81.83    50.99 
4000  5000  1.24  1.88    135.94    67.60 
5000  5000  1.41  1.99    166.60    71.92 
5000  10000  4.11  4.75         
7000  10000  4.76  5.52         
10000  10000  4.66  3.89         
50000  10000  13.97  6.34         
70000  10000  18.82  11.76         
100000  10000  23.38  23.42         
4.5 Nonnegative Least Squares
Least squares is probably the most widely used regression method. Nonnegative least squares is an extension that requires the output to be nonnegative. It is given as the following optimization problem
where is a given design matrix and is the response vector. Since nonnegative least squares has been studied intensively, there is a plenitude of solvers available that implement different optimization methods. An overview and comparison of the different methods can be found in Slawski13 . Here, we use the accompanying code described in Slawski13 for our comparison. We ran two sets of experiments, similarly to the comparisons in Slawski13 , where it was shown that the different algorithms behave quite differently on these problems. For experiment (i), we generated random data , where the entries of were sampled uniformly at random from the interval and a sparse vector
with nonzero entries sampled also from the uniform distribution of
and a sparsity of . The outcome values were then generated by , where . For experiment (ii),was drawn form a Gaussian distribution and
had a sparsity of . The outcome variable was generated by , where . The differences between the two experiments are the following: (1) The Gram matrix is singular in experiment (i) and regular in experiment (ii), (2) The design matrix has isotropic rows in experiment (ii) which does not hold for experiment (i), and (3) is significantly sparser in (i) than in (ii). We compared the solver generated by GENO with the following approaches: the classical LawsonHanson algorithm lawson95 , which employs an active set strategy, a projected gradient descent algorithm combined with an Armijoalongprojectionarc line search (Bertsekas99, , Ch 2.3), a primaldual interior point algorithm that uses a conjugate gradient descent algorithm BoydV04 with a diagonal preconditioner for solving the Newton system, a subspace BarzilaiBorwein approach kim13 , and Nesterov’s accelerated projected gradient descent Nesterov83 . Figure 3 shows the results for both experiments. Note, that the BarzilaiBorwein approach with standard parameter settings diverged on experiment (i) and it would stop making progress on experiment (ii). While the other approaches vary in running time depending on the problem, the experiments show that the solver generated by GENO is always among the fastest compared to the other approaches.We provide the final running times of the general purpose solvers in Table 5 since obtaining intermediate solutions is not possible for these solvers. Table 5 also provides the function values of the individual solvers. It can be seen, while the SCS solver is considerably faster than the ECOS solver, the solution computed by the SCS solver is not optimal in experiment (i). The ECOS solver provides a solution with the same accuracy as GENO but at a running time that is a few orders of magnitude larger.
m  n  GENO  ECOS  SCS  Gurobi  Mosek  

2000  6000  time  4.8  689.7  70.4  187.3  24.9 
fval  0.01306327  0.01306327  0.07116707  0.01306330  0.01306343  
6000  3000  time  0.3  3751.3  275.5  492.9  58.4 
fval  0.03999098  0.03999098  0.04000209  0.03999100  0.03999114 
4.6 Symmetric Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) and its many variants are standard methods for recommender systems adomavicius2005 and topic modeling blei2003 ; hofmann1999 . It is known as symmetric NMF, when both factor matrices are required to be identical. Symmetric NMF is used for clustering problems kuang2015 and known to be equivalent to means kernel clustering ding2005 . Given a target matrix , symmetric NMF is given as the following optimization problem
where is a positive factor matrix of rank . Note, the problem cannot be modeled and solved by CVXPY since it is nonconvex. It has been addressed recently in ZhuLLL18 by two new methods. Both methods are symmetric variants of the alternating nonnegative least squares (ANLS) kim2008 and the hierarchical ALS (HALS) cichocki2009 algorithms.
We compared GENO to both methods. For the comparison we used the code and same experimental setup as in ZhuLLL18 . Random positivesemidefinite target matrices of different sizes were computed from random matrices with absolute value Gaussian entries. As can be seen in Figure 4, GENO outperforms both methods (SymANLS and SymHALS) by a large margin.
4.7 Nonlinear Least Squares
GENO makes use of a quasiNewton solver which approximates the Hessian by the weighted sum of the identity matrix and a positive semidefinite, lowrank matrix. One could assume that this does not work well in case that the true Hessian is indefinite, i.e., in the nonconvex case. Hence, we also conducted some experiments on nonconvex problems. We followed the same setup and ran the same experiments as in
LiuLWYY18 and compared to stateoftheart solvers that were specifically designed to cope with nonconvex problems. Especially, we considered the nonlinear least squares problem, i.e., we seek to minimize the function , where is a data matrix, is a binary label vector, andis the sigmoid function. Figure
5 shows the convergence speed for the data set w1a and a1a. The stateoftheart specialized solvers that were introduced in LiuLWYY18 are SAdaNCG, which is a stochastic adaptive negative curvature and gradient algorithm, and AdaNCDSCSG, an adaptive negative curvature descent algorithm that uses SCSG lei2017 as a subroutine. The experiments show that GENO outperforms both algorithms by a large margin. In fact, on the data set a1a, both algorithms would not converge to the optimal solution with standard parameter settings. Again, this problem cannot be modeled and solved by CVXPY.4.8 Compressed Sensing
In compressed sensing, one tries to recover a sparse signal from a number of measurements CandesT06 ; donoho2006 . See the recent survey Rani18 for an overview on this topic. The problem can be reduced to finding the solution to an underdetermined system of linear equations with minimal norm. Hence, it can be written as the following optimization problem
(2) 
where is a measurement matrix and is the vector of measurements. Note, that this problem is a constrained problem with a nondifferentiable objective function. It is known that when matrix has the restricted isometry property and the true signal is sparse, then Problem (2) recovers the true signal with high probability, if the dimensions and are chosen properly candes2005 . There has been made considerable progress in designing algorithms that come with convergence guarantees chin2013 ; christiano2011 . Very recently, in Ene19 a new and efficient algorithm based on the iterative reweighted least squares (IRLS) technique has been proposed. Compared to previous approaches, their algorithm is simple and achieves the stateoftheart convergence guarantees for this problem.
We used the same setup and random data set as in Ene19 and ran the same experiment. The measurement matrix had been generated randomly, such that all rows are orthogonal. Then, a sparse signal with only 15 nonzero entries had been chosen and the corresponding measurement vector had been computed via . We compared to their IRLS algorithm with the longsteps update scheme. Figure 6 shows the convergence speed speed towards the optimal function value as well as the convergence towards feasibility. It can be seen that the solver generated by GENO outperforms the specialized, stateoftheart IRLS solver by a few orders of magnitude.
5 Conclusions
While other fields of applied research that heavily rely on optimization, like operations research, optimal control and deep learning, have adopted optimization frameworks, this is not the case for classical machine learning. Instead, classical machine learning methods are still mostly accessed through toolboxes like scikitlearn, Weka, or MLlib. These toolboxes provide wellengineered solutions for many of the standard problems, but lack the flexibility to adapt the underlying models when necessary. We attribute this state of affairs to a common belief that efficient optimization for classical machine learning needs to exploit the problem structure. Here, we have challenged this belief. We have presented GENO the first general purpose framework for problems from classical machine learning. GENO combines an easytoread modeling language with a general purpose solver. Experiments on a variety of problems from classical machine learning demonstrate that GENO is as efficient as established wellengineered solvers and often outperforms recently published stateoftheart solvers by a good margin. It is as flexible as stateoftheart modeling language and solver frameworks, but outperforms them by a few orders of magnitude.
Acknowledgments
Sören Laue has been funded by Deutsche Forschungsgemeinschaft (DFG) under grant LA 2971/11.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A system for largescale machine learning. In USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 265–283, 2016.
 [2] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the stateoftheart and possible extensions. IEEE Transactions on Knowledge & Data Engineering, (6):734–749, 2005.
 [3] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
 [4] Goran Banjac, Bartolomeo Stellato, Nicholas Moehle, Paul Goulart, Alberto Bemporad, and Stephen P. Boyd. Embedded code generation using the OSQP solver. In Conference on Decision and Control, (CDC), pages 1906–1911, 2017.
 [5] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
 [6] Ernesto G. Birgin and José Mario Martínez. Practical augmented Lagrangian methods for constrained optimization, volume 10 of Fundamentals of Algorithms. SIAM, 2014.
 [7] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
 [8] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
 [9] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
 [10] A. Brooke, D. Kendrick, and A. Meeraus. GAMS: release 2.25 : a user’s guide. The Scientific press series. Scientific Press, 1992.
 [11] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM J. Scientific Computing, 16(5):1190–1208, 1995.
 [12] Emmanuel J Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
 [13] Emmanuel J. Candès and Terence Tao. Nearoptimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Information Theory, 52(12):5406–5425, 2006.
 [14] ChihChung Chang and ChihJen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
 [15] Olivier Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5):1155–1178, 2007.
 [16] Hui Han Chin, Aleksander Madry, Gary L Miller, and Richard Peng. Runtime guarantees for regression problems. In Conference on Innovations in Theoretical Computer Science (ITCS), pages 269–282, 2013.

[17]
Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and
ShangHua Teng.
Electrical flows, laplacian systems, and faster approximation of
maximum flow in undirected graphs.
In
ACM Symposium on Theory of Computing (STOC)
, pages 273–282, 2011. 
[18]
Andrzej Cichocki and AnhHuy Phan.
Fast local algorithms for large scale nonnegative matrix and tensor factorizations.
Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 92(3):708–721, 2009.  [19] Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine Learning, 20(3):273–297, 1995.

[20]
David R. Cox.
The regression analysis of binary sequences (with discussion).
J. Roy. Stat. Soc. B, 20:215–242, 1958.  [21] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, December 2018.
 [22] Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems (NIPS), pages 676–684, 2016.
 [23] Aaron Defazio, Francis R. Bach, and Simon LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1646–1654, 2014.
 [24] Steven Diamond and Stephen Boyd. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.

[25]
Chris Ding, Xiaofeng He, and Horst D Simon.
On the equivalence of nonnegative matrix factorization and spectral clustering.
In SIAM International Conference on Data Mining (SDM), pages 606–610. SIAM, 2005.  [26] Alexander Domahidi, Eric Chu, and Stephen P. Boyd. ECOS: An SOCP Solver for Embedded Systems. In European Control Conference (ECC), 2013.
 [27] David Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.

[28]
Iain Dunning, Joey Huchette, and Miles Lubin.
JuMP: A modeling language for mathematical optimization.
SIAM Review, 59(2):295–320, 2017.  [29] Alina Ene and Adrian Vladu. Improved convergence for and regression via iteratively reweighted least squares. In International Conference on Machine Learning (ICML), 2019. To appear.
 [30] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
 [31] Robert Fourer, David M. Gay, and Brian W. Kernighan. AMPL: a modeling language for mathematical programming. Thomson/Brooks/Cole, 2003.
 [32] Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann, fourth edition, 2016.
 [33] Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
 [34] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17 – 40, 1976.
 [35] P. Giselsson and S. Boyd. Linear convergence and metric selection for DouglasRachford splitting and ADMM. IEEE Transactions on Automatic Control, 62(2):532–544, Feb 2017.
 [36] R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisationdualité d’une classe de problèmes de dirichlet non linéaires. ESAIM: Mathematical Modelling and Numerical Analysis  Modélisation Mathématique et Analyse Numérique, 9(R2):41–76, 1975.
 [37] JeanLouis Goffin. On convergence rates of subgradient optimization methods. Math. Program., 13(1):329–347, 1977.
 [38] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95–110. 2008.
 [39] Gurobi Optimization, Inc. Gurobi Optimizer. http://www.gurobi.com.
 [40] Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4(5):303–320, 1969.

[41]
Thomas Hofmann.
Probabilistic latent semantic analysis.
In
Conference on Uncertainty in Artificial Intelligence (UAI)
, pages 289–296, 1999.  [42] HoYi Huang. Unified approach to quadratically convergent algorithms for function minimization. Journal of Optimization Theory and Applications, 5(6):405–423, 1970.
 [43] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [44] Dongmin Kim, Suvrit Sra, and Inderjit S Dhillon. A nonmonotonic method for largescale nonnegative least squares. Optimization Methods and Software, 28(5):1012–1039, 2013.
 [45] Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In IEEE International Conference on Data Mining (ICDM), pages 353–362, 2008.
 [46] Da Kuang, Sangwoon Yun, and Haesun Park. Symnmf: nonnegative lowrank approximation of a similarity matrix for graph clustering. Journal of Global Optimization, 62(3):545–574, 2015.
 [47] Sören Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrix and tensor expressions. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 [48] Charles L. Lawson and Richard J. Hanson. Solving least squares problems. Classics in Applied Mathematics. SIAM, 1995.
 [49] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2348–2358, 2017.
 [50] Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems (NIPS), pages 3384–3392, 2015.
 [51] Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, and Tianbao Yang. Adaptive negative curvature descent with applications in nonconvex optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 4858–4867, 2018.
 [52] Jacob Mattingley and Stephen Boyd. CVXGEN: A Code Generator for Embedded Convex Optimization. Optimization and Engineering, 13(1):1–27, 2012.
 [53] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(1), January 2016.
 [54] José Luis Morales and Jorge Nocedal. Remark on "algorithm 778: LBFGSB: fortran subroutines for largescale bound constrained optimization". ACM Trans. Math. Softw., 38(1):7:1–7:4, 2011.
 [55] Jorge J. Moré and David J. Thuente. Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Softw., 20(3):286–307, 1994.
 [56] MOSEK ApS. The MOSEK optimization software. http://www.mosek.com.
 [57] L. Nazareth. A relationship between the bfgs and conjugate gradient algorithms and its implications for new algorithms. SIAM Journal on Numerical Analysis, 16(5):794–800, 1979.
 [58] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Doklady AN USSR (translated as Soviet Math. Docl.), 269, 1983.
 [59] Yurii Nesterov. Smooth minimization of nonsmooth functions. Math. Program., 103(1):127–152, 2005.
 [60] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous selfdual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, 2016.
 [61] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Autodiff workshop, 2017.
 [62] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [63] Nicholas G. Polson, James G. Scott, and Brandon T. Willard. Proximal algorithms in statistics and machine learning. arXiv preprint, May 2015.
 [64] M. J. D. Powell. Algorithms for nonlinear constraints that use Lagrangian functions. Mathematical Programming, 14(1):224–248, 1969.
 [65] Meenu Rani, S. B. Dhok, and R. B. Deshmukh. A systematic review of compressive sensing: Concepts, implementations and applications. IEEE Access, 6:4875–4894, 2018.
 [66] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 1951.
 [67] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
 [68] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 [69] Martin Slawski. Problemspecific analysis of nonnegative least squares solvers with a focus on instances with sparse solutions. https://sites.google.com/site/slawskimartin/code, March 2013.
 [70] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator splitting solver for quadratic programs. arXiv preprint, November 2017.
 [71] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226–235, 1969.
 [72] P. Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM Review, 13(2):185–188, 1971.
 [73] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: LBFGSB: fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw., 23(4):550–560, 1997.
 [74] Zhihui Zhu, Xiao Li, Kai Liu, and Qiuwei Li. Dropping symmetry for fast symmetric nonnegative matrix factorization. In Advances in Neural Information Processing Systems (NeurIPS), pages 5160–5170, 2018.
 [75] Yong Zhuang, YuChin Juan, GuoXun Yuan, and ChihJen Lin. Naive parallelization of coordinate descent methods and an application on multicore l1regularized classification. In International Conference on Information and Knowledge Management (CIKM), pages 1103–1112, 2018.
 [76] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
Appendix
Appendix A GENO Models for all Experiments
Appendix B Summary of Solvers
Name  Type  Reference 

GENO  quasiNewton w/ augmented Lagrangian  this paper 
OSQP  ADMM  [70] 
SCS  ADMM  [60] 
ECOS  interior point  [26] 
Gurobi  interior point  [39] 
Mosek  interior point  [56] 
LIBSVM  FrankWolfe (SMO)  [14] 
LIBLINEAR  conjugate gradient +  [30, 75] 
dual coordinate descent  
SAGA  SGD  [23] 
SDCA  SGD  [68] 
catalyst SDCA  SGD  [50] 
PointSAGA  SGD  [22] 
glmnet  dual coordinate descent  [33] 
LawsonHanson  direct linear equations solver w/ active set  [48] 
projected gradient descent  proximal algorithm  [69] 
primaldual interior point w/  interior point  [69] 
preconditioned conjugate gradient  
subspace BarlizaiBorwein  quasiNewton  [44] 
Nesterov’s method  accelerated gradient descent  [58] 
SymANLS  block coordinate descent  [74] 
SymHALS  block coordinate descent  [74] 
SAdaNCG  SGD  [51] 
AdaNCDSCSG  SGD  [51] 
IRLS  IRLS w/ conjugate gradient method  [29] 
Comments
There are no comments yet.