1 Introduction
Recent advances in machine learning and convex optimization techniques have led to very efficient algorithms for solving a family of regularized estimation problems. Sparsity, as a strong regularization prior, plays a central role in many inverse problems and the use of sparsitypromoting norms as regularizers has become widespread over many different disciplines of science and engineering. One added difficulty is the nondifferentiability of such priors, which prevents the use of classical optimization methods such as gradient descent or GaussNewton methods
[1, 2]. Proximal algorithms present an efficient alternative to cope with nonsmoothness of the objective function. Furthermore, in many practical situations, simple closedform updates of the variables of interest are possible. For an excellent review about proximal operators and algorithms see [3] and the monographs [4, 5].1.1 Motivation
Let be a real matrix with columns . The mixed norm of is defined over its columns as
(1) 
Mixed norms such as the matrix norm have been used to promote blocksparse structure in the variables of interest, and the larger the stronger the correlation among the rows of [6]. In particular, the norm has been shown to be useful for estimating a set of covariate regressors in problems such as multitask learning [7, 8, 6], and representative (exemplar) selection [9]
. A general formulation for these type of problems is to minimize some convex loss function subject to norm constraints:
(2) 
where controls the sparsity level and is some convex loss function. Note that keeping small encourages whole columns of to be zero. In this contribution, we are interested in efficiently solving problems of the form of (2). A simple method to solve problem (2) is to use a projected (sub)gradient descent method that computes the th iteration estimate as
(3)  
(4) 
where is the stepsize at the th iteration of the algorithm, denotes a subgradient (i.e., the gradient if differentiable) of at , and where denotes the projection of onto the ball of radius . The main computational challenge of such method is to project onto the mixed norm which can be computed by a proximal mapping of the dual norm—the mixed norm (i.e., the induced norm of seen as a linear operator). In this paper, we address the problem of projecting onto the mixed norm via the computation of the proximal operator of its dual norm. This allows us to solve the class of problems in (2) that involve structured/group sparsity, namely those involving constraints on (projections onto) the mixed norm. The proximal mapping of the mixed norm is also applicable to the computation of minimax sparse pseudoinverses to underdetermined systems of linear equations [10, 11].
1.2 Prior work
Since the main computational challenge in solving problems of the form (2) is in the computation of the projection onto the ball of a certain radius, it is then of practical importance to devise computationally efficient algorithms for computing such projections. An efficient method for computing these projections is proposed in [8]. The algorithm is based on sorting the entries of the by data matrix in order to find the values that satisfy the optimality conditions of the projection problem. The complexity of the method is dominated by the sorting operation and therefore has an average complexity of . An alternative strategy is to use rootsearch methods such as those in [12, 13] in order to find the optimal solution. Here we take an alternative approach and look at the proximal operator of the mixed matrix norm. Since the mixed and norms are duals of each other, a simple relationship can be established between the proximal operator and the projection operator (see Section 2). However, by looking at the proximal operator a better insight and understanding of the problem can be gained and exploited to accelerate the algorithms. Contrary to rootsearch methods our method is exact (up to machine precision), does not require any thresholds to determine convergence, and it is guaranteed to find the optimal solution in a finite number of iterations.
1.3 Contributions
In this paper we study the proximal operator of the mixed matrix norm and show that it can be computed using a generalization of the wellknown softthresholding operator from the vector to the matrix case. The generalization involves applying the softthresholding operator to each column of the matrix using a possibly different threshold for each column. Interestingly, all thresholds are related to each other via a quantity that depends on the given matrix. This is in sharp contrast to the vector case, where the threshold is constant and is given by the regularization parameter. To compute the proximal operator efficiently, we propose a general iterative algorithm based on the optimality conditions of the proximal problem. Our method is further accelerated by the derivation of easy to compute lower bounds on the optimal value of the proximal problem that contribute to effectively reduce the search space. A numerical comparison with the state of the art of two particular implementations of our general method reveals the improved computational efficiency of the proposed algorithms. We also illustrate the application of our results to biomarker discovery for the problem of cancer classification from gene expression data. The code used to generate the results presented in this paper is made publicly available by the authors.
2 Norms, projections, and proximal operators
In this section we present some background material that highlights the relationship between proximal operators, norms, and orthogonal projection operators.
Consider a nonempty closed convex set . The orthogonal projection of a point onto is given by
(5) 
where we have included an irrelevant factor for convenience in the exposition. Alternatively, we can also express the projection of a point as an unconstrained optimization problem as
(6) 
where we have moved the constraint into the objective by making use of the indicator function of a nonempty subset , which is given by
(7) 
Keeping in mind the definition of the projection operator given in (6) as an unconstrained optimization problem, we are now ready to introduce the definition of the proximal operator. Let be a lower semicontinuous convex function. Then, for every the proximal operator is defined as
(8) 
It is then clear, that the proximal operator can be regarded as a generalization of the projection operator (e.g., replace by the indicator function of a set ). Note that, at every point, the proximal operator is the unique solution of an unconstrained convex optimization problem. Uniqueness of the proximal operator can be easily argued from the fact that the quadratic term in (8) makes the optimization cost strictly convex.
An important particular case that often appears in practice is that where the function is a norm. For example, problems of the form of (8) appear in many learning and signal processing problems, where the quadratic term can be seen as a datafidelity term while the function can be thought of as imposing some prior on the solution (e.g., sparsity). The special case where is a norm has also a close connection to projections via the Moreau decomposition theorem as we shall describe next. Let be a lower semicontinuous convex function, then its Fenchel conjugate is defined as
(9) 
The Moreau decomposition theorem relates the proximal operators of a convex function and its Fenchel conjugate, as stated next.
Theorem 1 ([14])
Let be a lowersemicontinuous convex function and let denote its Fenchel (or convex) conjugate, then
(10) 
For the special case where is a norm, it is well known that its Fenchel conjugate is given by
(11) 
where is the dual norm of (i.e., ). That is, the Fenchel conjugate of a norm is the indicator function of the unitnorm ball of its dual norm (see for instance [2] for a proof). Since the proximal operator of the indicator function of a set equals the orthogonal projection onto the set, it follows from (10) that
(12) 
where denotes the projection onto the ball of radius of the dual norm, and where is the identity operator.
Let then its mixed (induced ) norm is given by
(13) 
where corresponds to the th column of matrix . For the case of the induced operator norm we have the wellknown relationship
(14) 
Also, recall the duality relationship between the norm and the mixed norm:
(15) 
Thus, without loss of generality, we will focus our analysis on the derivation of the proximal operator for the mixed norm, and derive expressions for the proximal operators of the induced and the projection operator onto the norm using the above relationships.
3 Analysis of the mixed norm proximal operator
The relationship given in (12) makes it clear that finding the proximal operator of a norm amounts to knowing how to project onto the unitnorm ball of the dual norm and viceversa. In [8] the authors derived the optimality conditions for the projection onto the norm (see (15)) and proposed an algorithm for its computation based on sorting the entries of the matrix. Since these norms are duals of each other, the proximal operator for such norms can be readily computed based on (12). In contrast, we look at the proximal operator itself and derive the optimality conditions. By doing so, we arrive at a more compact expression for the optimality conditions that generalizes the wellknown softthresholding algorithm to the matrix case. Our analysis allows for a more intuitive interpretation of the proximal operator as well as the derivation of novel algorithms for its computation.
Given a matrix , the proximal operator of the mixed norm with parameter is the solution to the following convex optimization problem:
(16) 
Using the definition of the mixed norm in (13), we can rewrite problem (16) as the following constrained optimization problem:
(17) 
By looking at the structure of problem (17) it is easy to derive the following result:
Lemma 1 (Matched Sign)
The sign of the optimal solution of (17) must match the sign of , that is
(18) 
where the function operates elementwise.
Proof:
The proof follows by contradiction. Assume is the optimal solution to problem (17) and that there are some nonzero entries of that have the opposite sign to the corresponding entries in , i.e., for some . Now, form the matrix such that . The point is feasible and causes a reduction in the objective function since while keeping the norm unchanged . This contradicts the assumption that is the optimal solution. ∎
Based on Lemma 1 the problem of finding the proximal operator in (17) boils down to finding the magnitudes of the entries of the matrix . Therefore, we can formulate it as^{1}^{1}1Notice that this is a power allocation problem which belongs to the general family of waterfilling problems [15].
(19) 
where is a matrix with nonnegative entries given by . The following result determines the optimal solution of problem (19) and, as a consequence, it also determines the proximal operator of the mixed norm:
Proposition 1
The optimal solution of problem (19) is given by
(20) 
and
(21) 
where , is the set of columns affected by thresholding, is the set of indices of the nonzero entries of , and
(22) 
is the th entry of the vector .
Proof:
The Lagrangian of problem (19) is given by
(23) 
Since the problem is convex, the necessary and sufficient conditions for optimality are given by the KKT conditions:

Zero gradient of the Lagrangian
(24) (25) 
Primal and dual feasibility
(26) (27) 
Complementary slackness
(28) (29) where denotes elementwise product.
We start by showing that equation (20) holds or equivalently, that every column of satisfies
(30) 
In order to do so, let be the set of columns that are affected by thresholding. Take for instance for some , then we have
(31) 
In this case we can have which, by (29), (26), (27) implies . Alternatively, we can have which means . Therefore, both situations can be written in compact form as
(32) 
where the thresholding operation is applied elementwise. Alternatively, take for some then from (28) it follows that and hence, . From (29) and the fact that it follows that for all . Therefore, we have that
(33) 
Since for we can put together (32) and (33) into a single expression as in (30). It remains now to derive an expression that relates and . We know from (28) and the fact that for that
(34) 
where we the set denotes the nonzero entries of . Solving for in (34) leads to
(35) 
Recall that for we have and therefore, we can compactly express as
and we recover the expression in (22). Finally, using equation (25) it is easy to check that
which completes the proof. ∎
We are now ready to derive an expression for the proximal operator of the mixed norm as:
Corollary 1 (Proximal Operator)
The expression in Corollary 1 resembles very much the wellknown softthresholding operator. In fact, the proximal operator of the mixed norm applies a softthresholding operation to every column of the matrix but with a different threshold value for each column (see Fig. 1). As expected, the above expression reduces to softthresholding for :
Corollary 2 (Softthresholding)
In the case so that is a vector, the proximal operator is given by the wellknown softthresholding
(37) 
Proof:
Corollary 3 (Projection onto the ball)
Remark
Note that the results presented in this section can be trivially extended to the case of complexvalued matrices by interpreting the sign operation as extracting the phase of a complex number (i.e., ).
4 Algorithms for computing the mixed norm proximal operator
The results in Proposition 1 and Corollary 1 give us the basis for finding an efficient algorithm for computing the mixed norm proximal operator. However, the computation of the proximal operator directly from those expressions requires knowledge about the optimal sets and , which are not known a priori. In this section we present a procedure for addressing this issue. But first, we describe an efficient preprocessing stage that can be used to reduce the search space for the optimal sets and needed to compute the proximal operator in Proposition 1. The idea is to maximize a lower bound on the mixed norm of the optimal solution, which allows us to discard columns that will not be affected by thresholding hence, reducing the search space for the optimal sets and . This allows us to effectively reduce the dimensionality of the problem since our algorithm will be then applied to a smaller matrix (i.e., a matrix which contains a subset of columns of the original input matrix). After describing a procedure to maximize such lower bound we then propose a general iterative algorithm that uses the results in Proposition 1 and Corollary 1 to find the right solution.
4.1 A lower bound on the norm
It follows from the analysis presented in Section 3 that only a subset of the columns of the matrix might be affected by the proximal operator (i.e., those with norm larger than ). This fact can be exploited to reduce the search space of the problem provided that some knowledge about the value of is available. In particular, having a lower bound on would allow us to discard columns with smaller norm. It turns out that a simple lower bound can be derived from the optimality conditions as stated in the following result:
Lemma 2 (Lowerbound on the norm)
Let for some . Let be the mixed norm of the optimal solution. Then, for any subset
(39) 
Proof:
From the optimality conditions of Problem (19) we know that , with , and that . Then, it follows that
(40) 
where the last inequality follows from the fact that . ∎
In order to reduce the search space of the problem, we can maximize the upper bound in (39) with respect to . Since the sum is maximized when we choose the columns of with the largest norm, a simple method to compute the set that maximizes is to sort the columns of according to their norm, evaluate the objective for the top columns, and choose the value of that maximizes , as described in Algorithm 1. Specifically, we form the vector that contains the norms of the columns of in decreasing order. From we compute the partial sums for , and evaluate the value of the bound (39) as described in Algorithm 1 to find its maximizer.
Note however, that while Algorithm 1 allows us to efficiently find a maximum lower bound for , depending on the parameter , the maximum lower bound might be smaller than zero, in which case it is not useful. In such case, an alternative lower bound for is given by the following result:
Lemma 3
Let for some . Let be the mixed norm of the optimal solution. Then, it holds that
(41) 
Proof:
Note that the bound in (41
) will be negative only if the optimal solution is the zero matrix since:
(43) 
4.2 A general algorithm
A general procedure for computing the proximal operator of the mixed norm can be devised based on the optimality conditions of Proposition 1 and the observation that, for a fixed , the problem in (19) boilds down to projecting the columns of onto the ball of radius . A possible strategy for finding is to start with a lower bound for , project each column of whose norm is above the current lower bound onto the ball of radius , update the value of the lower bound using (21), and keep iterating until there are no further changes in (see Algorithm 2). This algorithm is guaranteed to converge to the optimal solution, as stated next.
Proposition 2 (Convergence)
Algorithm 2 converges to the proximal operator of the mixed norm of matrix in a finite number of iterations.
Proof:
Observe that the algorithm produces a monotonic sequence of values for that eventually converges to the optimal value . To see this, note that for a given , the projection onto the ball has the form of (30) that is, for some value . Let denote the value at the optimal solution. Now since is a lower bound on then it is necessary the case that (hence ). Let denote the resulting sets after projecting onto the ball of radius . The new value is then given by
(44) 
where the last inequality follows from the fact that and (see (25)). In fact, the inequality is strict and it is satisfied with equality only when since in that case the optimality conditions of Proposition 1 are satisfied. Note that a change in can only happen if there is a change in the sets or . Since the number of possible sets is finite and due to the monotonicity of the values of the algorithm terminates in a finite number of iterations. ∎
5 Numerical experiments
5.1 Complexity and implementation
The complexity of Algorithm 2 depends on the method used for computing the projection step onto the simplex and there exist different alternatives in the literature [16]. A naive implementation of the proposed algorithm can lead to a computationally inefficient method if at every iteration the projection step is computed from scratch. Alternatively, one could exploit previous estimates from one iteration to the next in order to improve the computational efficiency. In this paper we propose two different approaches for computing the projection step onto the simplex: the first one is based on sorting the columns of the matrix of absolute values that are affected by thresholding. In such case, the expected complexity of the method is dominated by the sorting operation and it is of operations. The second approach is based on an active set method based on Proposition 1. In fact, the projection onto the simplex part is equivalent to the one in [17]. While in the latter case the complexity analysis is not straightforward, we have experimentally observed it to be more efficient than the sortingbased one in most of the tested scenarios.
5.2 Numerical validation
In order to evaluate the computational complexity of the proposed algorithms we randomly generate matrices in
with independent and identically distributed random entries drawn from a uniform distribution
. We then apply the proposed implementations of Algorithm 2 and label them “Sort” for the sortingbased implementation and “Active Set” for the one based on active sets to compute projections onto the mixed ball for different radius values. We also compute the projections using the state of the art algorithms. In particular we compare to the method proposed in [8] which we denote as “QT” and with the recently proposed rootsearch based methods of [13, 18] which we denote as “ST” (Steffensen) and “NT” (Newton), respectively. We record the execution time for different configurations (sizes) of the data matrix and for different values of the ball radius. In our experiments, we choose the radius of the ball to be a fraction of the true mixed norm of the matrix and compute the average computation time over realizations. For the methods in [8, 13, 18] we use the implementations provided by the authors. The results for different matrix sizes are displayed in Table LABEL:tab:timing. As it can be observed from the table, our two implementations achieve the best performance offering an improvement over the state of the art that ranges between one and two orders of magnitude.Size  ST  NT  QT  Sort  Active Set  

3.51E03  3.46E03  1.69E03  1.03E04  9.61E05  
1.33E02  9.17E03  1.70E03  1.44E04  1.35E04  
5.11E02  3.04E02  1.71E03  3.74E04  3.64E04  
5.49E02  3.65E02  1.73E03  7.02E04  6.94E04  
9.23E03  8.42E03  1.97E02  1.06E03  1.04E03  
3.81E02  2.69E02  1.96E02  2.49E03  2.47E03  
1.19E01  7.80E02  1.96E02  7.98E03  8.01E03  
1.16E01  8.21E02  1.96E02  9.25E03  9.28E03  
8.60E02  6.25E02  1.94E02  8.88E04  8.63E04  
4.46E01  2.60E01  1.94E02  1.28E03  1.26E03  
4.88E01  3.19E01  1.96E02  3.56E03  3.55E03  
6.01E01  3.96E01  1.98E02  6.82E03  6.82E03  
4.48E01  3.28E01  2.41E01  1.11E02  1.09E02  
1.41E+00  9.23E01  2.40E01  2.61E02  2.62E02  
1.60E+00  1.14E+00  2.40E01  8.24E02  8.22E02  
1.61E+00  1.17E+00  2.40E01  9.56E02  9.56E02  
1.55E+01  1.01E+01  3.41E+00  1.18E01  1.18E01  
1.53E+01  1.06E+01  3.30E+00  2.62E01  2.63E01  
1.83E+01  1.31E+01  3.33E+00  8.31E01  8.31E01  
1.97E+01  1.42E+01  3.33E+00  9.64E01  9.64E01 
5.3 Application to cancer classification from gene expression data
In this section we test our algorithms in the context of multitask learning for the problem of cancer classification from gene expression data where the dimensionality of the feature vectors is typically much larger than the number of samples . We use the datasets provided in [19] which consist of five curated datasets of different types of cancers as described in [19]. The datasets are briefly summarized in Table LABEL:tab:datasets. We pose the classification problem as a multitask learning problem. In particular, given a dataset of points with associated labels , with and , where is the number of classes, we build a data matrix and target label matrix with
(45) 
Dataset  Classes  Samples  Dimension 

Carcinom [20, 21]  11  174  9182 
GLIOMA [22]  4  50  4434 
LUNG [23]  5  203  3312 
ALLAML [24]  2  72  7129 
ProstateGE [25]  2  102  5966 
The problem is to predict the correct label for each class while enforcing feature sharing among them:
(46) 
Note that problem (46) falls within the family of problems in (2) which can be solved using a projected gradient descent strategy. For the projection step onto the ball we use the sortingbased implementation of Algorithm 2.
We conducted an experiment using the datasets of Table LABEL:tab:datasets
where we center the data points (mean subtraction) and normalize them by dividing each coordinate by its standard deviation. For each dataset we split the data into
training and testing and computed the average classification performance over random data splits. Once we solve (46) we use the following simple classification rule:(47) 
In addition, we use the learned weights to identify relevant features and train a (kernel) support vector machine (SVM) classifier on the identified features. Features are sorted according to the Euclidean norm of the columns of
being the most relevant index the one with larger norm. For the multiclass problem we use a oneversusone strategy with majority voting. We also provide a comparison with thenorm based feature selection method of
[19] for which we used the implementation provided by the authors. The ball radius in (46) as well as the regularization parameter for the method in [19] were chosen using a grid search. The average classification accuracy of both methods the classification rule (47) are summarized in Table LABEL:tab:accuracy. As it can be appreciated we observe that the proposed method using the norm achieves better classification accuracy than the method based on the proposed in [19]. It is important to note that the differences are more pronounced in multiclass problems than in binary ones indicating as expected, that the norm encourages the discovery of variables that are most correlated.Dataset  Carcinom  GLIOMA  LUNG  ALLAML  ProstateGE 

[11 classes]  [4 classes]  [5 classes]  [2 classes]  [2 classes]  
(Nie et al.)  95.50  68.90  76.95  92.36  93.25 
(Proposed)  97.74  78.50  83.28  95.07  93.65 
We also report the classification results using an SVM classifier and for different number of features used. The results are displayed in Fig. 2 for all datasets. We can observe the superior performance of the proposed scheme in selecting relevant features for the discrimination task. Again the performance gap is generally more pronounced on those datasets with more than two classes.
6 Conclusions
In this paper we have analyzed in detail the proximity operator of the mixed matrix norm. We have provided simple expressions for its computation that generalize the wellknown softthresholding algorithm. By exploiting the duality relationship to the norm we also derive the projection operator onto the mixed norm. In addition, we have proposed a general algorithm for the computation of the proximal operator and two particular implementations that can be orders of magnitude faster than the state of the art making them particularly suitable for largescale problems. We have also illustrated the application of the norm for biomarker discovery (feature selection) for the problem of cancer classification from gene expression data.
Acknowledgment
This research was supported in part by the Northrop Grumman Mission Systems Research in Applications for Learning Machines (REALM) initiative.
References
 [1] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York, NY, USA: Springer, 2006.
 [2] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004.
 [3] P. L. Combettes and J.C. Pesquet, “Proximal Splitting Methods in Signal Processing,” in FixedPoint Algorithms for Inverse Problems in Science and Engineering, H. Bauschke, R. Burachik, P. Combettes, V. Elser, D. Luke, and H. Wolkowicz, Eds. Springer, 2011, pp. 185–212. [Online]. Available: https://hal.inria.fr/hal00643807
 [4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsityinducing penalties,” Foundations and Trends® in Machine Learning, vol. 4, no. 1, pp. 1–106, 2012. [Online]. Available: http://dx.doi.org/10.1561/2200000015
 [5] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014. [Online]. Available: http://dx.doi.org/10.1561/2400000003
 [6] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection and joint subspace selection for multiple classification problems,” Statistics and Computing, vol. 20, no. 2, pp. 231–252, 2010. [Online]. Available: http://dx.doi.org/10.1007/s112220089111x
 [7] B. A. Turlach, W. N. Venables, and S. J. Wright, “Simultaneous variable selection,” Technometrics, vol. 47, no. 3, pp. 349–363, Aug. 2005.
 [8] A. Quattoni, X. Carreras, M. Collins, and T. Darrell, “An Efficient Projection for Regularization,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 857–864. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553484

[9]
E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse
modeling for finding representative objects,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2012.  [10] I. Dokmanić and R. Gribonval, “Beyond moorepenrose part I: generalized inverses that minimize matrix norms,” CoRR, vol. abs/1706.08349, 2017. [Online]. Available: http://arxiv.org/abs/1706.08349
 [11] I. Dokmanić and R. Gribonval, “Concentration of the frobenius norm of generalized matrix inverses,” SIAM Journal on Matrix Analysis and Applications, vol. 40, no. 1, pp. 92–121, 2019.
 [12] S. Sra, “Fast projections onto norm balls for grouped feature selection,” in Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 305–317.
 [13] G. Chau, B. Wohlberg, and P. Rodríguez, “Fast projection onto the mixed norm ball using steffensen root search,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018, pp. 4694–4698.
 [14] J. J. Moreau, “Proximité et dualtité dans un espace Hilbertien,” Bulletin de la Societé Mathématique de France, vol. 93, pp. 273–299, 1965.
 [15] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of waterfilling solutions,” IEEE Transactions on Signal Processing, vol. 53, no. 2, pp. 686–695, Feb 2005.
 [16] L. Condat, “Fast projection onto the simplex and the ball,” Mathematical Programming, vol. 158, no. 1, pp. 575–585, 2016. [Online]. Available: http://dx.doi.org/10.1007/s1010701509466
 [17] C. Michelot, “A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of ,” Journal of Optimization Theory and Applications, vol. 50, no. 1, pp. 195–200, 1986.
 [18] G. Chau, B. Wohlberg, and P. Rodríguez, “Efficient projection onto the mixednorm ball using a newton root search method,” SIAM Journal on Imaging Sciences, 2019, accepted for publication.
 [19] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and Robust Feature Selection via Joint Norms Minimization,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems  Volume 2, ser. NIPS’10. USA: Curran Associates, Inc., 2010, pp. 1813–1821.
 [20] A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F. Frierson, and G. M. Hampton, “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer Research, vol. 61, no. 20, pp. 7388–7393, 2001. [Online]. Available: http://cancerres.aacrjournals.org/content/61/20/7388
 [21] K. Yang, Z. Cai, J. Li, and G. Lin, “A stable gene selection in microarray data analysis.” BMC Bioinformatics, vol. 7, pp. 228 – 243, 2006. [Online]. Available: http://search.ebscohost.com.proxy1.library.jhu.edu/login.aspx?direct=true&db=asn&AN=43698712&site=ehostlive&scope=site
 [22] C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor, P. M. Black, A. von Deimling, S. L. Pomeroy, T. R. Golub, and D. N. Louis, “Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification,” Cancer Research, vol. 63, no. 7, pp. 1602–1607, 2003. [Online]. Available: http://cancerres.aacrjournals.org/content/63/7/1602
 [23] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, and M. Meyerson, “Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 24, pp. 13 790–13 795, 2001. [Online]. Available: http://www.jstor.org/stable/3057173
 [24] S. P. Fodor, “Massively parallel genomics.” Science, vol. 277, no. 5324, pp. 393 – 395, 1997.
 [25] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers, “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, pp. 203 – 209, 2002. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1535610802000302
Comments
There are no comments yet.