1 Introduction
Classification tasks on data sets with large feature dimensions are very common in realworld machine learning applications. Typical examples include microarray data for gene selection and text documents for natural language processing. Despite the large number of features present in the data sets, usually only small subsets of the features are relevant to the particular learning tasks, and local correlation among the features is often observed. Hence, feature selection is required for good model interpretability. Popular classification techniques, such as support vector machine (SVM) and logistic regression, are formulated as convex optimization problems. An extensive literature has been devoted to optimization algorithms that solve variants of these classification models with sparsity regularization
[12, 17]. Many of them are based on firstorder (gradientbased) methods, mainly because the size of the optimization problem is very large. The advantage of firstorder methods is that their computational and memory requirements at each iteration are low and as a result they can handle the large optimization problems occurring in classification problems. Their major disadvantage is their slow convergence, especially when a good approximation of the feature support has been identified. Secondorder methods exhibit fast local convergence, but their computational and memory requirements are much more demanding, since they need to store and invert the Newton matrix at every iteration. It is therefore very important to be able to intelligently combine the advantages of both the first and the second order optimization methods in such a way that the resulting algorithm can solve large classification problems efficiently and accurately. As we will demonstrate in this paper such combination is possible by taking advantage of the problem structure and the change in its size during the solution process. In addition, we will also show that our algorithmic framework is flexible enough to incorporate prior knowledge to improve classification performance.1.1 Related Work
The above requirements demand three features from a learning algorithm: 1. it should be able to automatically select features which are possibly in groups and highly correlated; 2. it has to solve the optimization problem in the training phase efficiently and with high accuracy; and 3. the learning model needs to be flexible enough so that domain knowledge can be easily incorporated. Existing methods are available in the literature that meet some of the above requirements individually. For enforcing sparsity in the solution, efficient optimization algorithms such as that proposed in [9] can solve largescale sparse logistic regression. On the other hand, the regularization is unstable with the presence of highly correlated features  among a group of such features, essentially one of them is selected in a random manner. To handle local correlation among groups of features, the elasticnet regularization [22] has been successfully applied to SVM [20] and logistic regression [15]. However, incorporating domain knowledge into the logistic regression formulation is not straightforward. For SVM, including such knowledge in the optimization process has been demonstrated in [5]. Recently, an Alternating Direction Method of Multipliers (ADMM) has been proposed for the elasticnet SVM (ENSVM) [21]. ADMM is quick to find an approximate solution to the ENSVM problem, but it is known to converge very slowly to high accuracy optimal solutions [2]. The interiorpoint methods (IPM) for SVM are known to be able to achieve high accuracy in their solutions with a polynomial iteration complexity, and the dual SVM formulation is independent of the feature space dimensionality. However, the classical norm SVM is not able to perform automatic feature selection. Although the elasticnet SVM can be formulated as a QP (in the primal form), its problem size grows substantially with the feature dimensionality. Due to the need to solve a Newton system in each iteration, the efficiency of IPM quickly deteriorates as the feature dimension becomes large.
1.2 Main Contributions
In this paper we propose a new hybrid algorithmic framework for SVM to address all of the above challenges and requirements simultaneously. Our framework combines the advantages of a firstorder optimization algorithm (through the use of ADMM) and a secondorder method (via IPM) to achieve both superior speed and accuracy. Through a novel algorithmic approach that is able to incorporate expert knowledge, our proposed framework is able to exploit domain knowledge to improve feature selection, and hence, prediction accuracy. Besides efficiency and generalization performance, we demonstrate through experiments on both synthetic and real data that our method is also more robust to inaccuracy in the supplied knowledge than existing approaches.
2 A Twophase Hybrid Optimization Algorithm
As previously mentioned, for data sets with many features, the high dimensionality of the feature space still poses a computational challenge for IPM. Fortunately, many data sets of this kind are very sparse, and the resulting classifier
is also expected to be sparse, i.e. only a small subset of the features are expected to carry significant weights in classification. Naturally, it is ideal for IPM to train a classifier on the most important features only.Inspired by the Hybrid Iterative Shrinkage (HIS) [16] algorithm for training largescale sparse logistic regression classifiers, we propose a twophase algorithm to shrink the feature space appropriately so as to leverage the high accuracy of IPM while maintaining efficiency. Specifically, we propose to solve an elasticnet SVM (ENSVM) or doublyregularized SVM (DrSVM) [20] problem during the first phase of the algorithm. The elasticnet regularization performs feature selection with grouping effect and has been shown to be effective on data sets with many but sparse features and high local correlations [22]. This is the case for text classification, microarray gene expression, and fMRI data sets. The support of the weight vector for ENSVM usually stabilizes well before the algorithm converges to the optimal solution. Taking advantage of that prospect, we can terminate the first phase of the hybrid algorithm early and proceed to solve a classical SVM problem with the reduced feature set in the second phase, using an IPM solver.
2.1 Solving the Elastic Net SVM using ADMM
SVM can be written in the regularized regression form as
(1) 
where the first term is an averaged sum of the hinge losses and the second term is viewed as a ridge regularization on . It is easy to see from this form that the classical SVM does not enforce sparsity in the solution, and is generally dense. The ENSVM adds an regularization on top of the ridge regularization term, giving
(2) 
Compared to the Lasso (regularized regression) [18], the elasticnet has the advantage of selecting highly correlated features in groups (i.e. the grouping effect) while still enforcing sparsity in the solution. This is a particularly attractive feature for text document data, which is common in the hierarchical classification setting. Adopting the elasticnet regularization as in (2) brings the same benefit to SVM for training classifiers.
To approximately solve problem (2), we adopt the alternating direction method of multipliers (ADMM) for elasticnet SVM recently proposed in [21]. ADMM has a long history dating back to the 1970s [6]. Recently, it has been successfully applied to problems in machine learning [2]. ADMM is a special case of the inexact augmented Lagrangian (IAL) method [14] for the structured unconstrained problem
(3) 
where both functions and are convex. We can decouple the two functions by introducing an auxiliary variable and convert problem (3) into an equivalent constrained optimization problem
(4) 
This technique is often called variablesplitting [3]. The IAL method approximately minimizes in each iteration the augmented Lagrangian of (4) defined by , followed by an update to the Lagrange multiplier . The IAL method is guaranteed to converge to the optimal solution of (3), as long as the subproblem of approximately minimizing the augmented Lagrangian is solved with an increasing accuracy [14]. ADMM can be viewed as a practical implementation of IAL, where the subproblem is solved approximately by minimizing with respect to and alternatingly once. Eckstein and Bertsekas [4] established the convergence of ADMM for the case of twoway splitting. Now applying variablesplitting and ADMM to problem (2), [21] introduced auxiliary variables and linear constraints so that the nonsmooth hinge loss and norm in the objective function are decoupled, making it easy to optimize over each of the variables. Specifically, problem (2) is transformed into an equivalent constrained form
(5)  
where is the th row of , and . The augmented Lagrangian
(6) 
is then minimized with respect to and c sequentially in each iteration, followed by an update to the Lagrange multipliers and
. The original problem is thus decomposed into three subproblems consisting of computing the proximal operator of the hinge loss function (with respect to
a), solving a special linear system (with respect to ), and performing a softthresholding operation (with respect to c), which can all be done in an efficient manner. Due to lack of space in the paper, we have included the detailed solution steps in the Appendix (see Algorithm 0.A.1 ADMMENSVM), where we define by the proximal operator associated with the hinge lossand is the shrinkage operator.
2.2 SVM via InteriorPoint Method
Interior Point Methods enjoy fast convergence rates for a wide class of QP problems. Their theoretical polynomial convergence () was first established by Mizuno [13]. In addition, Andersen et al [1] showed that the number of iterations needed by IPMs to converge is , which demonstrates that their computational effort increases in a slower rate than the size of the problem.
Both the primal and the dual SVM are QP problems. The primal formulation of SVM [19] is defined as
whereas the dual SVM has the form
where . By considering the KKT conditions of (SVMP) and (SVMD), the optimal solution is given by , where is the set of sample indices corresponding to the support vectors. The optimal bias term can be computed from the complementary slackness condition .
Whether to solve (SVMP) or (SVMD) for a given data set depends on its dimensions as well as its sparsity. Even if is a sparse matrix, in (SVMD) is still likely to be dense, whereas the Hessian matrix in (SVMP) is the identity. The primal problem (SVMP), however, has a larger variable dimension and more constraints. It is often argued that one should solve (SVMP) when the number of features is smaller than the number of samples, whereas (SVMD) should be solved when the number of features is less than that of the samples. Since in the second phase of Algorithm 2.1 we expect to have identified a small number of promising features, we have decided to solve (SVMD) by using IPM. Solving (SVMD) is realized through the OOQP [7] software package that implements a primaldual IPM for convex QP problems.
2.3 The Twophase Algorithm
Let us keep in mind that the primary objective for the first phase is to appropriately reduce the feature space dimensionality without impacting the final prediction accuracy. As we mentioned above, the suitability of ADMM for the first phase depends on whether the support of the feature vector converges quickly or not. On an illustrative dataset from [21], which has 50 samples with 300 features each, ADMM converged in 558 iterations. The output classifier w contained only 13 nonzero features, and the feature support converged in approximately 50 iteration (see Figure 1 in the Appendix for illustrative plots showing the early convergence of ADMM). Although the remaining more than 500 iterations are needed by ADMM in order to satisfy the optimality criteria, they do not offer any additional information regarding the feature selection process. Hence, it is important to monitor the change in the signs and indices of the support and terminate the first phase promptly. In our implementation, we adopt the criterion used in [16] and monitor the relative change in the iterates as a surrogate of the change in the support, i.e.
(7) 
We have observed in our experiments that when the change over the iterates is small, the evolution of the support indices stabilizes too.
Upon starting the second phase, it is desirable for IPM to warmstart from the corresponding subvector of the solution returned by ADMM. It should also be noted that we apply IPM during the second phase to solve the classical regularized SVM (1), instead of the ENSVM (2) in the first phase. There are two main reasons for this decision. First, although ENSVM can be reformulated as a QP, the size of the problem is larger than the classical SVM due to the additional linear constraints introduced by the norm. Second, since we have already identified (approximately) the feature support in the first phase of the algorithm, enforcing sparsity in the reduced feature space becomes less critical. The twophase algorithm is summarized in Algorithm 2.1.
3 Domain Knowledge Incorporation
Very often, we have prior domain knowledge for specific classification tasks. Domain knowledge is most helpful when the training data does not form a comprehensive representation of the underlying unknown population, resulting in poor generalization performance of SVM on the unseen data from the same population. This often arises in situations where labeled training samples are scarce, while there is an abundance of unlabeled data.
For high dimensional data, ENSVM performs feature selection along with training to produce a simpler model and to achieve better prediction accuracy. However, the quality of the feature selection depends entirely on the training data. In pathological cases, it is very likely that the feature support identified by ENSVM does not form a good representation of the population. Hence, when domain knowledge about certain features is available, we should take it into consideration during the training phase and include the relevant features in the feature support should them be deemed important for classification.
In this section, we explore and propose a new approach to achieve this objective. We consider domain knowledge in the form of classmembership information associated with features. We can incorporate such information (or enforce such rules) in SVM by adding equivalent linear constraints to the SVM QP problem (KSVM) [5, 10]. To be specific, we can model the above information with the linear implication
(8) 
where and . It is shown in [5] that by utilizing the nonhomogeneous Farkas theorem of the alternative, (8) can be transformed into the following equivalent system of linear inequalities with a solution u
(9) 
Similarly, for the linear implication for the negative class membership we have:
(10) 
which can be represented by the set of linear constraints in v
(11) 
Hence, to incorporate the domain knowledge represented by (8) and (10) into SVM, Fung et al [5] simply add the linear constraints (9) and (11) to (SVMP). Their formulation, however, increases both the variable dimension and the number of linear constraints by at least , where is the number of features in the classification problem we want to solve. This is clearly undesirable when is large, which is the scenario that we consider in this paper.
In order to avoid the above increase in the size of the optimization probelm, we choose to penalize the quadratics and instead of their counterparts considered in [5]. By doing so the resulting problem is still a convex QP but with a much smaller size. Hence, we consider the following model for domain knowledge incorporation.
We are now ready to propose a novel combination of ENSVM and KSVM, and we will explain in the next section how the combined problem can be solved in our HIPAD framework. The main motivation behind this combination is to exploit domain knowledge to improve the feature selection, and hence, the generalization performance of HIPAD. To the best of our knowledge, this is the first method of this kind.
3.1 ADMM Phase
Our strategy for solving the elasticnet SVM with domain knowledge incorporation is still to apply the ADMM method. First, we combine problems (2) and (KSVMP) and write the resulting optimization problem in an equivalent unconstrained form (by penalizing the violation of the inequality constraints through hinge losses in the objective function)
where . We then apply variablesplitting to decouple the norms and hinge losses and obtain the following equivalent constrained optimization problem:
(12)  
with . As usual, we form the augmented Lagrangian of problem (12),
and minimize with respect to individually and in order. For the sake of readability, we do not penalize the nonnegative constraints for u and v in the augmented Lagrangian.
Given , solving for again involves solving a linear system
(13) 
where and . Similar to solving the linear system in Algorithm 0.A.1 ADMMENSVM, we can compute the solution to the above linear system through a few PCG iterations, taking advantage of the fact that the lefthandside matrix is of lowrank.
To minimize the augmented Lagrangian with respect to u, we need to solve a convex quadratic problem with nonnegative constraints
(14) 
Solving problem (14) efficiently is crucial for the efficiency of the overal algorithm. We describe a novel way to do so. Introducing a slack variable s and transferring the nonnegative constraint on u to s, we decompose the problem into two parts which are easy to solve. Specifically, we reformulate (14) as
Penalizing the linear constraint in the new augmented Lagrangian, the new subproblem with respect to is
(15) 
Given an , we can compute by solving a linear system
(16) 
where . We assume that has full rowrank. This is a reasonable assumption since otherwise there is at least one redundant domain knowledge constraint and we can simply remove it. The number of domain knowledge constraints ( and ) are usually small, so the system (16) can be solved exactly and efficiently by Cholesky factorization.
Solving for corresponding to is easy, observing that problem (15) is separable in the elements of s. For each element , the optimal solution to the onedimensional quadratic problem with a nonnegative constraint on is given by . Writing in the vector form, . Similarly, we solve for by introducing a nonnegative slack variable t and solve the linear system
(17) 
where , and .
Now given , the solutions for a and c are exactly the same as in Lines 4 and 5 of Algorithm 2.1, i.e.
The subproblem with respect to is
(18)  
The solution is given by a (onedimensional) proximal operator associated with the hinge loss
(19) 
Similarly, the subproblem with respect to is
and the solution is given by
(20) 
Due to lack of space in the paper, we summarize the detailed solution steps in the Appendix (see Algorithm 0.A.2 ADMMENK)
Although there appears to be ten additional parameters (six ’s and four ’s) in the ADMM method for ENKSVM, we can usually set the ’s to the same value and do the same for the ’s. Hence, in practice, there is only one additional parameter to tune, and our computational experience in Section 4.2 is that the algorithm is fairly insensitive to the ’s and ’s.
3.2 IPM Phase
The second phase for solving the knowledgebased SVM problem defined by (KSVMP) follows the same steps as that described in section 2.2. Note that in the knowledgebased case we have decided to solve the primal problem. This decision was based on extensive numerical experiments with both the primal and dual formulation which revealed that the primal formulation is more efficient.
We found in our experiments that by introducing slack variables and transforming the above problem into a linearly equalityconstrained QP, Phase 2 of HIPAD usually requires less time to solve.
3.3 HIPAD with domain knowledge incorporation
We formally state the new twophase algorithm for the elasticnet KSVM in Algorithm 3.1.
4 Numerical results
We present our numerical experience with the two main algorithms proposed in this paper: HIPAD and its knowledgebased version HIPADENK. We compare their performance with their nonhybrid counterparts, i.e., ADMMENSVM and ADMMENK, which use ADMM to solve the original SVM problem. The transition condition at the end of Phase 1 is specified in (7), with . The stopping criteria for ADMM are as follows: and , with , and .
4.1 HIPAD vs ADMM
To demonstrate the practical effectiveness of HIPAD, we tested the algorithm on nine real data sets which are publicly available. rcv1 [11] is a collection of manually categorized news wires from Reuters. The original multiple labels have been consolidated into two classes for binary classification. realsim contains UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, and real aviation. Both rcv1 and realsim have large feature dimensions but are highly sparse. The rest of the seven data sets are all dense data. rcv1 and realsim are subsets of the original data sets, where we randomly sampled 500 training instances and 1,000 test instances. gisette is a handwriting digit recognition problem from NIPS 2003 Feature Selection Challenge, and we also subsampled 500 instances for training. (For testing, we used the original test set of 1,000 instances.) duke, leukemia, and coloncancer are data sets of gene expression profiles for breast cancer, leukemia, and colon cancer respectively. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs. The data was compiled and made available by Tom Mitchell’s neuroinformatics research group ^{1}^{1}1http://www.cs.cmu.edu/ tom/fmri.html. Except the three fMRI data sets, all the other data sets and their references are available at the LIBSVM website ^{2}^{2}2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets.
The parameters of HIPAD, ADMMENSVM, and LIBSVM were selected through cross validation on the training data. We summarize the experiment results in Table 1
. Clearly, HIPAD produced the best overall predication performance. In order to test the significance of the difference, we used the test statistic in
[8] based on Friedman’s , and the results are significant at . In terms of CPUtime, HIPAD consistently outperformed ADMMENSVM by several times on dense data. The feature support sizes selected by HIPAD were also very competitive or even better than the ones selected by ADMMENSVM. In most cases, HIPAD was able to shrink the feature space to below 10 of the original size.Data set  HIPAD  ADMMENSVM  LIBSVM  
Accuracy ()  Support size  CPU (s)  Accuracy ()  Support size  CPU (s)  accuracy ()  
rcv1  86.9  2,037  1.18  86.8  7,002  1.10  86.1 
realsim  94.0  2,334  0.79  93.9  2,307  0.31  93.4 
gisette  94.7  498  8.96  63.1  493  45.87  93.4 
duke  90  168  1.56  90  150  5.52  80 
leukemia  85.3  393  1.70  82.4  717  6.35  82.4 
coloncancer  84.4  195  0.45  84.4  195  1.34  84.4 
fMRIa  90  157  0.25  90  137  2.17  60 
fMRIb  90  45  0.23  90  680  0.75  90 
fMRIc  90  321  0.14  90  659  1.58  90 
4.2 Simulation for Knowledge Incorporation
We generated synthetic data to simulate the example presented at the beginning of Section 3 in the high dimensional feature space. Specifically, four groups of multivariate Gaussians were sampled from and for four disjoint blocks of feature values (). For positive class samples, ; for negative class samples, . All the covariance matrices have 1 on the diagonal and 0.8 everywhere else. The training samples contain blocks and
, while all four blocks are present in the test samples. A random fraction (5%10%) of the remaining entries in all the samples are generated from the standard Gaussian distribution.
The training samples are apparently hard to separate because the values of blocks and for the two classes are close to each other. However, blocks and in the test samples are wellseparated. Hence, if we are given information about these two blocks as general knowledge for the entire population, we could expect the resulting SVM classifier to perform much better on the test data. Since we know the mean values of the distributions from which the entries in and are generated, we can supply the following information about the relationship between the block sample means and class membership to the KSVM: , and where is the length of the , , and represent the positive and negative classes, and the lowercase denotes the th entry of the sample x. Translating into the notation of (KSVMP), we have
The information given here is not precise, in that we are confident that a sample should belong to the positive (or negative) class only when the corresponding block sample mean well exceeds the distribution mean. This is consistent with realworld situations, where the domain or expert knowledge tends to be conservative and often does not come in exact form.
We simulated two sets of synthetic data for ENKSVM as described above, with for ksvms10k and for ksvms50k. The number of features in each of the four blocks () is 50 for both data sets. Clearly, HIPADENK is very effective in terms of speed, feature selection, and prediction accuracy on these two data sets. Even though the features in blocks and are not discriminating in the training data, HIPADENK was still able to identify all the 200 features in the four blocks correctly and exactly. This is precisely what we want to achieve as we explained at the beginning of Section 3
. The expert knowledge not only helps rectify the separating hyperplane so that it generalizes better on the entire population, but also makes the training algorithm realize the significance of the features in blocks
and .Data set  HIPADENK  ADMMENK  LIBSVM  

Accuracy ()  Support size  CPU (s)  Accuracy ()  Support size  CPU (s)  accuracy ()  
ksvms10k  99  200  1.99  97.25  200  3.43  86.1 
ksvms50k  98.8  200  8.37  96.4  198  20.89  74.8 
5 Conclusion
We have proposed a twophase hybrid optimization framework for solving the ENSVM, in which the first phase is solved by ADMM, followed by IPM in the second phase. In addition, we have proposed a knowledgebased extension of the ENSVM which can be solved by the same hybrid framework. Through a set of experiments, we demonstrated that our method has significant advantage over the existing method in terms of computation time and the resulting prediction accuracy. The algorithmic framework introduced in this paper is general enough and potentially applicable to other regularizationbased classification or regression problems.
References

[1]
G. J. Andersen, A., C. Meszaros, and X. Xu.
Implementation of interior point methods for large scale linear programming.
In In ”Interior point methods in mathematical programming”, T. Terlaky, editor, pages 189–252. Kluwer Academic Publishers, 1996.  [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
 [3] P. Combettes and J. Pesquet. Proximal splitting methods in signal processing. FixedPoint Algorithms for Inverse Problems in Science and Engineering, pages 185–212, 2011.
 [4] J. Eckstein and D. Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293–318, 1992.
 [5] G. Fung, O. Mangasarian, and J. Shavlik. Knowledgebased support vector machine classifiers. Advances in Neural Information Processing Systems, pages 537–544, 2003.
 [6] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
 [7] E. Gertz and S. Wright. Objectoriented software for quadratic programming. ACM Transactions on Mathematical Software (TOMS), 29(1):58–81, 2003.
 [8] R. L. Iman and J. M. Davenport. Approximations of the critical region of the fbietkan statistic. Communications in StatisticsTheory and Methods, 9(6):571–595, 1980.
 [9] K. Koh, S. Kim, and S. Boyd. An interiorpoint method for largescale l1regularized logistic regression. Journal of Machine learning research, 8(8):1519–1555, 2007.
 [10] F. Lauer and G. Bloch. Incorporating prior knowledge in support vector machines for classification: A review. Neurocomputing, 71(79):1578–1594, 2008.
 [11] D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004.
 [12] P. P. M. and P. E. Hansen. Data Mining and Mathematical Programming. American Mathematical Society, 2008.
 [13] S. Mizuno. Polynomiality of infeasible interior point algorithms for linear programming. Mathematical Programming, 67:109–119, 1994.
 [14] R. Rockafellar. The multiplier method of hestenes and powell applied to convex programming. Journal of Optimization Theory and Applications, 12(6):555–562, 1973.
 [15] S. Ryali, K. Supekar, D. Abrams, and V. Menon. Sparse logistic regression for wholebrain classification of fmri data. NeuroImage, 51(2):752–764, 2010.
 [16] J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for largescale l 1regularized logistic regression. The Journal of Machine Learning Research, 11:713–741, 2010.
 [17] N. S. Sra, S. and S. Wright. Optimization for Machine Learning. MIT Press, 2011.
 [18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.

[19]
V. Vapnik.
The nature of statistical learning theory
. Springer Verlag, 2000.  [20] L. Wang, J. Zhu, and H. Zou. The doubly regularized support vector machine. Statistica Sinica, 16(2):589–615, 2006.

[21]
G.B. Ye, Y. Chen, and X. Xie.
Efficient variable selection in support vector machines via the
alternating direction method of multipliers.
In
International Conference on Artificial Intelligence and Statistics
, pages 832–840, 2011.  [22] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
Comments
There are no comments yet.