1 License and Citations
Cyanure is distributed under BSD3Clause license. Even though this is nonlegally binding, the author kindly ask users to cite the present arXiv document in their publications, as well as the publication related to the algorithm they have chosen (see Section 4 for the related publications). If the default solver is used, the publication is most likely to be (Lin et al., 2019).
2 Main Features
Cyanure is build upon several goals and principles:

Cyanure is memory efficient. If Cyanure accepts your dataset, it will never make a copy of it. Matrices can be provided in double or single precision. Sparse matrices (scipy/CSR format for Python, CSC for C++) can be provided with integers coded in 32 or 64bits. When fitting an intercept, there is no need to add a column of 1’s and there is no matrix copy as well.

Cyanure implements fast algorithms. Cyanure builds upon two algorithmic principles: (i) variancereduced stochastic optimization; (ii) Nesterov of QuasiNewton acceleration. Variancereduced stochastic optimization algorithms are now popular, but tend to perform poorly when the objective function is badly conditioned. We observe large gains when combining these approaches with QuasiNewton.

Cyanure only depends on your BLAS implementation. Cyanure does not depend on external libraries, except a BLAS library and numpy for Python. We show how to link with OpenBlas and Intel MKL in the python package, but any other BLAS implementation will do.

Cyanure can handle many combinations of loss and regularization functions. Cyanure can handle a vast combination of loss functions (logistic, square, squared hinge, multiclass logistic) with regularization functions (, , elasticnet, fused lasso, multitask group lasso).

Cyanure provides optimization guarantees. We believe that reproducibility is important in research. For this reason, knowing if you have solved your problem when the algorithm stops is important. Cyanure provides such a guarantee with a mechanism called duality gap, see Appendix D.2 of Mairal (2010).

Cyanure is easy to use. We have developed a very simple API, relatively close to scikitlearn’s API (Pedregosa et al., 2011), and provide also compatibility functions with scikitlearn in order to use Cyanure with minimum effort.

Cyanure should not be only for Python. A python interface is provided for the C++ code, but it should be feasible to develop an interface for any language with a C++ API, such as R or Matlab. We are planning to develop such interfaces in the future.
Besides all these nice features, Cyanure has also probably some drawbacks, which we will let you discover by yourself.
3 Where to find Cyanure and how to install it?
The webpage of the project is available at http://julien.mairal.org/cyanure/ and detailed instructions are given there. The software package can be found on github and Pipy.
4 Formulations
Cyanure addresses the minimization of empirical risks, which covers a large number of classical formulations such as logistic regression, support vector machines with squared hinge loss (we do handle the regular hinge loss at the moment), or multinomial logistic.
Univariate problems.
We consider a dataset of pairs of labels/features , where the features are vectors in and the labels are in for regression problems and in for classification. Cyanure learns a predition function or , respectively for regression and classification, where in represents the weights of a linear model and is an (optional) intercept. Learning these parameters is achieved by minimizing
where is a regularization function, which will be detailed later, and is a loss function that is chosen among the following choices
Loss  Label  

square  
logistic  
sqhinge  
safelogistic  if and otherwise 
Loss functions used for univariate machine learning problems.
logistic corresponds to the loss function used for logistic regression. You may be used to a different (but equivalent) formula if you use labels in . In order to illustrate the choice of loss functions, we plot in Figure 1 their values as functions of when . We note that all of them are differentiable—a requirement of our algorithms—, which is why we do not handle the regular hinge loss.
These problems will be handled by the classes BinaryClassifier and Regression for binary classification and regression problems, respectively. If you are used to scikitlearn (Pedregosa et al., 2011), we also provide the classes LinearSVC and LogisticRegression, which are compatible with scikitlearn’s API (in particular, they use the parameter C instead of ).
The regularization function is chosen among the following choices
Penalty  

l2  
l1  
elasticnet  
fusedlasso  
none  
Constraint  
l1ball  (constraint) 
l2ball  (constraint) 
Note that the intercept is not regularized and the last two regularization functions are encoding constraints on the variable .
Multivariate problems.
We now consider a dataset of pairs of labels/features , where the features are vectors in and the labels are in for multivariate regression problems and in
for multiclass classifiation. For regression, Cyanure learns a predition function
, where in represents the weights of a linear model and in is a set of (optional) intercepts. For classification, the prediction function is simply .Then, Cyanure optimizes the following cost function
The loss function may be chosen among the following ones
Loss  Label  

square  
square, logistic, sqhinge, safelogistic  , with chosen from Table 1  
multiclasslogistic 
The regularization functions may be chosen according to the following table.
Penalty or constraint  

l2, l1, elasticnet, fusedlasso, none, l1ball, l2ball  , with chosen from Table 2 
l1l2  ( is the th row of ) 
l1linf 
In Cyanure, the classes MultiVariateRegression and MultiClassifier are devoted to multivariate machine learning problems.
Algorithms.
Cyanure implements the following solvers:

ista: the basic proximal gradient descent method with linesearch (see Beck and Teboulle, 2009);

fista: its accelerated variant (Beck and Teboulle, 2009);

qningista: the QuasiNewton variant of ISTA proposed by Lin et al. (2019); this is an effective variant of LBFGS, which can handle composite optimization problems.

catalystmiso: its accelerated variant, by using the Catalyst approach of Lin et al. (2018);

qningmiso: its QuasiNewton variant, by using the QNing approach of Lin et al. (2019);

catalystsvrg: its accelerated variant by using Catalyst (Lin et al., 2018);

qningsvrg: its QuasiNewton variant by using QNing (Lin et al., 2019);

accsvrg: a variant of SVRG with direct acceleration introduced in (Kulunchakov and Mairal, 2019);

auto: auto will use qningista for small problems, or catalystmiso, or qningmiso, depending on the conditioning of the problem.
5 How to use it?
The online documentation provides information about the four main classes BinaryClassifier, Regression, MultiVariateRegression, and MultiClassifier, and how to use them. Below, we provide simple examples.
Examples for binary classification.
The following code performs binary classification with
regularized logistic regression, with no intercept, on the criteo dataset (21Gb, huge sparse matrix) python import cyanure as ars #load criteo dataset 21Gb, n=45840617, p=999999 dataY=np.load(’criteo_y.npz’,allow_pickle=True); y=dataY[’y’] X = scipy.sparse.load_npz(’criteo_X.npz’) #normalize the rows of X inplace, without performing any copy ars.preprocess(X,normalize=True,columns=False) #declare a binary classifier for l2logistic regression classifier=ars.BinaryClassifier(loss=’logistic’,penalty=’l2’) # uses the auto solver by default, performs at most 500 epochs classifier.fit(X,y,lambd=0.1/X.shape[0],nepochs=500,tol=1e3,it0=5)
Before we comment the previous choices, let us run the above code on a regular threeyearsold quadcore workstation with 32Gb of memory (Intel Xeon CPU E51630 v4, 400$ retail price).
Matrix X, n=45840617, p=999999 ********************************* Catalyst Accelerator MISO Solver Incremental Solver with uniform sampling Lipschitz constant: 0.250004 Logistic Loss is used L2 regularization Epoch: 5, primal objective: 0.456014, time: 92.5784 Best relative duality gap: 14383.9 Epoch: 10, primal objective: 0.450885, time: 227.593 Best relative duality gap: 1004.69 Epoch: 15, primal objective: 0.450728, time: 367.939 Best relative duality gap: 6.50049 Epoch: 20, primal objective: 0.450724, time: 502.954 Best relative duality gap: 0.068658 Epoch: 25, primal objective: 0.450724, time: 643.323 Best relative duality gap: 0.00173208 Epoch: 30, primal objective: 0.450724, time: 778.363 Best relative duality gap: 0.00173207 Epoch: 35, primal objective: 0.450724, time: 909.426 Best relative duality gap: 9.36947e05 Time elapsed : 928.114
The solver used was catalystmiso; the problem was solved up to accuracy in about mn after 35 epochs (without taking into account the time to load the dataset from the hard drive). The regularization parameter was chosen to be , which is close to the optimal one given by crossvalidation. Even though performing a grid search with crossvalidation would be more costly, it nevertheless shows that processing such a large dataset does not necessarily require to massively invest in Amazon EC2 credits, GPUs, or distributed computing architectures.
In the next example, we use the squared hinge loss with regularization, choosing the regularization parameter such that the obtained solution has about nonzero coefficients. We also fit an intercept. As shown below, the solution is obtained in 26s on a laptop with a quadcore i78565U CPU. python import cyanure as ars #load rcv1 dataset about 1Gb, n=781265, p=47152 data = np.load(’rcv1.npz’,allow_pickle=True); y=data[’y’]; X=data[’X’] X = scipy.sparse.csc_matrix(X.all()).T # n x p matrix, csr format #normalize the rows of X inplace, without performing any copy ars.preprocess(X,normalize=True,columns=False) #declare a binary classifier for squared hinge loss + l1 regularization classifier=ars.BinaryClassifier(loss=’sqhinge’,penalty=’l2’) # uses the auto solver by default, performs at most 500 epochs classifier.fit(X,y,lambd=0.000005,nepochs=500,tol=1e3) which yields
Matrix X, n=781265, p=47152 Memory parameter: 20 ********************************* QNing Accelerator MISO Solver Incremental Solver with uniform sampling Lipschitz constant: 1 Squared Hinge Loss is used L1 regularization Epoch: 10, primal objective: 0.0915524, time: 7.33038 Best relative duality gap: 0.387338 Epoch: 20, primal objective: 0.0915441, time: 15.524 Best relative duality gap: 0.00426003 Epoch: 30, primal objective: 0.0915441, time: 25.738 Best relative duality gap: 0.000312145 Time elapsed : 26.0225 Total additional line search steps: 8 Total skipping lbfgs steps: 0
Multiclass classification.
Let us now do something a bit more involved and perform multinomial logistic regression on the ckn_mnist dataset (10 classes, , , dense matrix), with multitask group lasso regularization, using the same laptop as previously, and choosing a regularization parameter that yields a solution with non zero coefficients. python import cyanure as ars #load ckn_mnist dataset 10 classes, n=60000, p=2304 data=np.load(’ckn_mnist.npz’); y=data[’y’]; X=data[’X’] #center and normalize the rows of X inplace, without performing any copy ars.preprocess(X,centering=True,normalize=True,columns=False) #declare a multinomial logistic classifier with group Lasso regularization classifier=ars.MultiClassifier(loss=’multiclasslogistic’,penalty=’l1l2’) # uses the auto solver by default, performs at most 500 epochs classifier.fit(X,y,lambd=0.0001,nepochs=500,tol=1e3,it0=5)
Matrix X, n=60000, p=2304 Memory parameter: 20 ********************************* QNing Accelerator MISO Solver Incremental Solver with uniform sampling Lipschitz constant: 0.25 Multiclass logistic Loss is used Mixed L1L2 norm regularization Epoch: 5, primal objective: 0.340267, time: 30.2643 Best relative duality gap: 0.332051 Epoch: 10, primal objective: 0.337646, time: 62.0562 Best relative duality gap: 0.0695877 Epoch: 15, primal objective: 0.337337, time: 93.9541 Best relative duality gap: 0.0172626 Epoch: 20, primal objective: 0.337293, time: 125.683 Best relative duality gap: 0.0106066 Epoch: 25, primal objective: 0.337285, time: 170.044 Best relative duality gap: 0.00409663 Epoch: 30, primal objective: 0.337284, time: 214.419 Best relative duality gap: 0.000677961 Time elapsed : 215.074 Total additional line search steps: 4 Total skipping lbfgs steps: 0
Learning the multiclass classifier took about 3mn and 35s. To conclude, we provide a last more classical example of learning l2logistic regression classifiers on the same dataset, in a onevsall fashion. python import cyanure as ars #load ckn_mnist dataset 10 classes, n=60000, p=2304 data=np.load(’ckn_mnist.npz’); y=data[’y’]; X=data[’X’] #center and normalize the rows of X inplace, without performing any copy ars.preprocess(X,centering=True,normalize=True,columns=False) #declare a multinomial logistic classifier with group Lasso regularization classifier=ars.MultiClassifier(loss=’logistic’,penalty=’l2’) # uses the auto solver by default, performs at most 500 epochs classifier.fit(X,y,lambd=0.01/X.shape[0],nepochs=500,tol=1e3) Then, the classifiers are learned in parallel using the four cpu cores (still on the same laptop), which gives the following output after about mn
Matrix X, n=60000, p=2304 Solver 4 has terminated after 30 epochs in 36.3953 seconds Primal objective: 0.00877348, relative duality gap: 8.54385e05 Solver 8 has terminated after 30 epochs in 37.5156 seconds Primal objective: 0.0150244, relative duality gap: 0.000311491 Solver 9 has terminated after 30 epochs in 38.4993 seconds Primal objective: 0.0161167, relative duality gap: 0.000290268 Solver 7 has terminated after 30 epochs in 39.5971 seconds Primal objective: 0.0105672, relative duality gap: 6.49337e05 Solver 0 has terminated after 40 epochs in 45.1612 seconds Primal objective: 0.00577768, relative duality gap: 3.6291e05 Solver 6 has terminated after 40 epochs in 45.8909 seconds Primal objective: 0.00687928, relative duality gap: 0.000175357 Solver 2 has terminated after 40 epochs in 45.9899 seconds Primal objective: 0.0104324, relative duality gap: 1.63646e06 Solver 5 has terminated after 40 epochs in 47.1608 seconds Primal objective: 0.00900643, relative duality gap: 3.42144e05 Solver 3 has terminated after 30 epochs in 12.8874 seconds Primal objective: 0.00804966, relative duality gap: 0.000200631 Solver 1 has terminated after 40 epochs in 15.8949 seconds Primal objective: 0.00487406, relative duality gap: 0.000584138 Time for the onevsall strategy Time elapsed : 62.9996
Note that the toolbox also provides the classes LinearSVC and LogisticRegression that are nearcompatible with scikitlearn’s API. More is available in the online documentation.
6 Benchmarks
We consider the problem of logistic regression for binary classification, or multinomial logistic regression if multiple class are present. We will present the results obtained by the solvers of Cyanure on 11 datasets, presented in Table 5
Dataset  Sparse  classes  Size (in Gb)  
covtype  No  1  581 012  54  0.25 
alpha  No  1  500 000  500  2 
realsim  No  1  72 309  20 958  0.044 
epsilon  No  1  250 000  2 000  4 
ocr  No  1  2 500 000  1 155  23.1 
rcv1  Yes  1  781 265  47 152  0.95 
webspam  Yes  1  250 000  16 609 143  14.95 
kddb  Yes  1  19 264 097  28 875 157  6.9 
criteo  Yes  1  45 840 617  999 999  21 
ckn_mnist  No  10  60000  2304  0.55 
ckn_svhn  No  10  604 388  18 432  89 
Experimental setup
To select a reasonable regularization parameter for each dataset, we first split each dataset into 80% training and 20% validation, and select the optimal parameter from a logarithmic grid with when evaluating trained model on the validation set. Then, we keep the optimal parameter , merge training and validation sets and report the objective function values in terms of CPU time for various solvers. The CPU time is reported when running the different methods on an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with 128Gb of memory (in order to be able to handle the ckn_svhn dataset), limiting the maximum number of cores to 8. Note that most solvers of Cyanure are sequential algorithms that do not exploit multicore capabilities. Those are nevertheless exploited by the Intel MKL library that we use for dense matrices. Gains with multiple cores are mostly noticeable for the methods ista, fista, and qningista, which are able to exploit BLAS3 (matrixmatrix multiplication) instructions. Experiments were conducted on Linux using the Anaconda Python 3.7 distribution.
In the evaluation, we include solvers that can be called from scikitlearn Pedregosa et al. (2011), such as Liblinear Fan et al. (2008), LBFGS Nocedal (1980), newtoncg, or the SAGA Defazio et al. (2014) implementation of scikitlearn. We run each solver with different tolerance parameter in order to obtain several points illustrating their accuracyspeed tradeoff. Each method is run for at most 500 epochs.
There are 11 datasets, and we are going to group them into categories leading to similar conclusions. We start with five datasets that require a small regularization parameter (e.g., , which lead to more difficult optimization problems since there is less strong convexity. This first group of results is presented in Figure 7, leading to the following conclusions

qning and catalyst accelerations are very useful. Note that catalyst works well in practice both for svrg and miso (regular miso, not shown on the plots, is an order of magnitude slower than its accelerated variants).

qningmiso and catalystmiso are the best solvers here, better than svrg variants. The main reason is the fact that for t iterations, svrg computes 3t gradients, vs. only t for the miso algorithms. miso also better handle sparse matrices (no need to code lazy update strategies, which can be painful to implement).

Cyanure does much better than sklearnsaga, liblinear, and lbfgs, sometimes with several orders of magnitudes. Note that sklearnsaga does as bad as our regular srvg solver for these dataset, which confirms that the key to obtain faster results is acceleration.

Direct acceleration (accsvrg) works a bit better than catalystsvrg: in fact accsvrg is close to qningsvrg here.
Then, we present results on the six other datasets (the “easy ones” in terms of optimization) in Figure 14. For these datasets, the optimal regularization parameter is close to , which is a regime where acceleration does not bring benefits in theory. The results below are consistent with theory and we can draw the following conclusions:

accelerations is useless here, as predicted by theory, which is why the ’auto’ solver only uses acceleration when needed.

qningmiso and catalystmiso are still among the best solvers here, but the difference with svrg is smaller. sklearnsaga is sometimes competitive, sometimes not.
Acknowledgments
This work was supported by the ERC grant SOLARIS (number 714381) and by ANR 3IA MIAI@Grenoble Alpes.
References
 Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 Defazio et al. [2014] A. Defazio, F. Bach, and S. LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 Fan et al. [2008] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
 Kulunchakov and Mairal [2019] A. Kulunchakov and J. Mairal. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. arXiv preprint arXiv:1901.08788, 2019.
 Lin et al. [2018] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for firstorder convex optimization: from theory to practice. Journal of Machine Learning Research (JMLR), 18(212):1–54, 2018.
 Lin et al. [2019] H. Lin, J. Mairal, and Z. Harchaoui. An inexact variable metric proximal point algorithm for generic quasinewton acceleration. SIAM Journal on Optimization, 29(2):1408–1443, 2019.

Mairal [2010]
J. Mairal.
Sparse coding for machine learning, image processing and computer vision
. PhD thesis, Cachan, Ecole normale supérieure, 2010.  Mairal [2015] J. Mairal. Incremental majorizationminimization optimization with application to largescale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
 Mairal [2016] J. Mairal. Endtoend kernel learning with supervised convolutional kernel networks. In Adv. in Neural Information Processing Systems (NIPS), 2016.
 Nocedal [1980] J. Nocedal. Updating quasiNewton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
 Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikitlearn: Machine learning in python. Journal of Machine Learning Research (JMLR), 12(Oct):2825–2830, 2011.
 ShalevShwartz and Zhang [2012] S. ShalevShwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012.
 Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
Comments
There are no comments yet.