# The fastest ℓ_1,∞ prox in the west

Proximal operators are of particular interest in optimization problems dealing with non-smooth objectives because in many practical cases they lead to optimization algorithms whose updates can be computed in closed form or very efficiently. A well-known example is the proximal operator of the vector ℓ_1 norm, which is given by the soft-thresholding operator. In this paper we study the proximal operator of the mixed ℓ_1,∞ matrix norm and show that it can be computed in closed form by applying the well-known soft-thresholding operator to each column of the matrix. However, unlike the vector ℓ_1 norm case where the threshold is constant, in the mixed ℓ_1,∞ norm case each column of the matrix might require a different threshold and all thresholds depend on the given matrix. We propose a general iterative algorithm for computing these thresholds, as well as two efficient implementations that further exploit easy to compute lower bounds for the mixed norm of the optimal solution. Experiments on large-scale synthetic and real data indicate that the proposed methods can be orders of magnitude faster than state-of-the-art methods.

There are no comments yet.

## Authors

• 2 publications
• 29 publications
• 64 publications
07/02/2020

### Efficient Proximal Mapping of the 1-path-norm of Shallow Networks

We demonstrate two new important properties of the 1-path-norm of shallo...
02/12/2021

### Proximal and Federated Random Reshuffling

Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD)...
11/09/2020

### Reduced-Rank Regression with Operator Norm Error

A common data analysis task is the reduced-rank regression problem: ...
03/16/2013

### l_2,p Matrix Norm and Its Application in Feature Selection

Recently, l_2,1 matrix norm has been widely applied to many areas such a...
12/19/2019

### Parseval Proximal Neural Networks

The aim of this paper is twofold. First, we show that a certain concaten...
02/12/2021

### From perspective maps to epigraphical projections

The projection onto the epigraph or a level set of a closed proper conve...
03/09/2018

### Local Kernels that Approximate Bayesian Regularization and Proximal Operators

In this work, we broadly connect kernel-based filtering (e.g. approaches...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent advances in machine learning and convex optimization techniques have led to very efficient algorithms for solving a family of regularized estimation problems. Sparsity, as a strong regularization prior, plays a central role in many inverse problems and the use of sparsity-promoting norms as regularizers has become widespread over many different disciplines of science and engineering. One added difficulty is the non-differentiability of such priors, which prevents the use of classical optimization methods such as gradient descent or Gauss-Newton methods

[1, 2]. Proximal algorithms present an efficient alternative to cope with non-smoothness of the objective function. Furthermore, in many practical situations, simple closed-form updates of the variables of interest are possible. For an excellent review about proximal operators and algorithms see [3] and the monographs [4, 5].

### 1.1 Motivation

Let be a real matrix with columns . The mixed norm of is defined over its columns as

 ∥X∥p,q=(m∑i=1∥xi∥qp)1/q. (1)

Mixed norms such as the matrix norm have been used to promote block-sparse structure in the variables of interest, and the larger the stronger the correlation among the rows of [6]. In particular, the norm has been shown to be useful for estimating a set of covariate regressors in problems such as multi-task learning [7, 8, 6], and representative (exemplar) selection [9]

. A general formulation for these type of problems is to minimize some convex loss function subject to norm constraints:

 minimizeXJ(X)subject to∥X∥∞,1≤τ, (2)

where controls the sparsity level and is some convex loss function. Note that keeping small encourages whole columns of to be zero. In this contribution, we are interested in efficiently solving problems of the form of (2). A simple method to solve problem (2) is to use a projected (sub)gradient descent method that computes the th iteration estimate as

 Z ← X(k−1)−ηk∂J(X(k−1)) (3) X(k) ← P∥⋅∥∞,1≤τ(Z), (4)

where is the stepsize at the th iteration of the algorithm, denotes a subgradient (i.e., the gradient if differentiable) of at , and where denotes the projection of onto the ball of radius . The main computational challenge of such method is to project onto the mixed norm which can be computed by a proximal mapping of the dual norm—the mixed norm (i.e., the induced norm of seen as a linear operator). In this paper, we address the problem of projecting onto the mixed norm via the computation of the proximal operator of its dual norm. This allows us to solve the class of problems in (2) that involve structured/group sparsity, namely those involving constraints on (projections onto) the mixed norm. The proximal mapping of the mixed norm is also applicable to the computation of minimax sparse pseudoinverses to underdetermined systems of linear equations [10, 11].

### 1.2 Prior work

Since the main computational challenge in solving problems of the form (2) is in the computation of the projection onto the ball of a certain radius, it is then of practical importance to devise computationally efficient algorithms for computing such projections. An efficient method for computing these projections is proposed in [8]. The algorithm is based on sorting the entries of the by data matrix in order to find the values that satisfy the optimality conditions of the projection problem. The complexity of the method is dominated by the sorting operation and therefore has an average complexity of . An alternative strategy is to use root-search methods such as those in [12, 13] in order to find the optimal solution. Here we take an alternative approach and look at the proximal operator of the mixed matrix norm. Since the mixed and norms are duals of each other, a simple relationship can be established between the proximal operator and the projection operator (see Section 2). However, by looking at the proximal operator a better insight and understanding of the problem can be gained and exploited to accelerate the algorithms. Contrary to root-search methods our method is exact (up to machine precision), does not require any thresholds to determine convergence, and it is guaranteed to find the optimal solution in a finite number of iterations.

### 1.3 Contributions

In this paper we study the proximal operator of the mixed matrix norm and show that it can be computed using a generalization of the well-known soft-thresholding operator from the vector to the matrix case. The generalization involves applying the soft-thresholding operator to each column of the matrix using a possibly different threshold for each column. Interestingly, all thresholds are related to each other via a quantity that depends on the given matrix. This is in sharp contrast to the vector case, where the threshold is constant and is given by the regularization parameter. To compute the proximal operator efficiently, we propose a general iterative algorithm based on the optimality conditions of the proximal problem. Our method is further accelerated by the derivation of easy to compute lower bounds on the optimal value of the proximal problem that contribute to effectively reduce the search space. A numerical comparison with the state of the art of two particular implementations of our general method reveals the improved computational efficiency of the proposed algorithms. We also illustrate the application of our results to biomarker discovery for the problem of cancer classification from gene expression data. The code used to generate the results presented in this paper is made publicly available by the authors.

## 2 Norms, projections, and proximal operators

In this section we present some background material that highlights the relationship between proximal operators, norms, and orthogonal projection operators.

Consider a non-empty closed convex set . The orthogonal projection of a point onto is given by

 PC(x)=argminy∈C12∥x−y∥22, (5)

where we have included an irrelevant factor for convenience in the exposition. Alternatively, we can also express the projection of a point as an unconstrained optimization problem as

 PC(x)=argminyI(y∈C)+12∥x−y∥22, (6)

where we have moved the constraint into the objective by making use of the indicator function of a non-empty subset , which is given by

 I(x∈X)={0,x∈X+∞,otherwise. (7)

Keeping in mind the definition of the projection operator given in (6) as an unconstrained optimization problem, we are now ready to introduce the definition of the proximal operator. Let be a lower semicontinuous convex function. Then, for every the proximal operator is defined as

 proxf(x)=argminyf(y)+12∥x−y∥22. (8)

It is then clear, that the proximal operator can be regarded as a generalization of the projection operator (e.g., replace by the indicator function of a set ). Note that, at every point, the proximal operator is the unique solution of an unconstrained convex optimization problem. Uniqueness of the proximal operator can be easily argued from the fact that the quadratic term in (8) makes the optimization cost strictly convex.

An important particular case that often appears in practice is that where the function is a norm. For example, problems of the form of (8) appear in many learning and signal processing problems, where the quadratic term can be seen as a data-fidelity term while the function can be thought of as imposing some prior on the solution (e.g., sparsity). The special case where is a norm has also a close connection to projections via the Moreau decomposition theorem as we shall describe next. Let be a lower semicontinuous convex function, then its Fenchel conjugate is defined as

 f∗(y)=supx∈X{⟨y,x⟩−f(x)}. (9)

The Moreau decomposition theorem relates the proximal operators of a convex function and its Fenchel conjugate, as stated next.

###### Theorem 1 ([14])

Let be a lower-semicontinuous convex function and let denote its Fenchel (or convex) conjugate, then

 proxf(x)+proxf∗(x)=x. (10)

For the special case where is a norm, it is well known that its Fenchel conjugate is given by

 f∗(x)=I(∥x∥∗≤1)={[]rl0,∥x∥∗≤1+∞,otherwise, (11)

where is the dual norm of (i.e., ). That is, the Fenchel conjugate of a norm is the indicator function of the unit-norm ball of its dual norm (see for instance [2] for a proof). Since the proximal operator of the indicator function of a set equals the orthogonal projection onto the set, it follows from (10) that

 proxλ∥⋅∥=I−P∥⋅∥∗≤λ, (12)

where denotes the projection onto the ball of radius of the dual norm, and where is the identity operator.

Let then its mixed (induced ) norm is given by

 ∥X∥1,∞=max∥u∥1=1∥Xu∥1=maxi∥xi∥1, (13)

where corresponds to the th column of matrix . For the case of the induced operator norm we have the well-known relationship

 ∥X∥∞=max∥u∥∞=1∥Xu∥∞=∥XT∥1,∞. (14)

Also, recall the duality relationship between the norm and the mixed norm:

 ∥X∥∞,1=m∑i=1∥xi∥∞=(∥X∥1,∞)∗. (15)

Thus, without loss of generality, we will focus our analysis on the derivation of the proximal operator for the mixed norm, and derive expressions for the proximal operators of the induced and the projection operator onto the norm using the above relationships.

## 3 Analysis of the mixed ℓ1,∞ norm proximal operator

The relationship given in (12) makes it clear that finding the proximal operator of a norm amounts to knowing how to project onto the unit-norm ball of the dual norm and vice-versa. In [8] the authors derived the optimality conditions for the projection onto the norm (see (15)) and proposed an algorithm for its computation based on sorting the entries of the matrix. Since these norms are duals of each other, the proximal operator for such norms can be readily computed based on (12). In contrast, we look at the proximal operator itself and derive the optimality conditions. By doing so, we arrive at a more compact expression for the optimality conditions that generalizes the well-known soft-thresholding algorithm to the matrix case. Our analysis allows for a more intuitive interpretation of the proximal operator as well as the derivation of novel algorithms for its computation.

Given a matrix , the proximal operator of the mixed norm with parameter is the solution to the following convex optimization problem:

 proxλ∥⋅∥1,∞(V)=argminX∥X∥1,∞+12λ∥X−V∥2F. (16)

Using the definition of the mixed norm in (13), we can rewrite problem (16) as the following constrained optimization problem:

 minimizeX,tt+12λ∥X−V∥2Fsubject to∥xi∥1≤t,i=1,…,m. (17)

By looking at the structure of problem (17) it is easy to derive the following result:

###### Lemma 1 (Matched Sign)

The sign of the optimal solution of (17) must match the sign of , that is

 sign(X⋆)=sign(V), (18)

where the function operates element-wise.

###### Proof:

The proof follows by contradiction. Assume is the optimal solution to problem (17) and that there are some nonzero entries of that have the opposite sign to the corresponding entries in , i.e., for some . Now, form the matrix such that . The point is feasible and causes a reduction in the objective function since while keeping the norm unchanged . This contradicts the assumption that is the optimal solution. ∎

Based on Lemma 1 the problem of finding the proximal operator in (17) boils down to finding the magnitudes of the entries of the matrix . Therefore, we can formulate it as111Notice that this is a power allocation problem which belongs to the general family of waterfilling problems [15].

 minimizeX,tt+12λ∥X−U∥2Fsubject to1Txi≤t,xi≥0,i=1,…,m, (19)

where is a matrix with non-negative entries given by . The following result determines the optimal solution of problem (19) and, as a consequence, it also determines the proximal operator of the mixed norm:

###### Proposition 1

The optimal solution of problem (19) is given by

 X⋆=[U−λ1μT]+, (20)

and

 t⋆=∑i∈M⋆1|J⋆i|∑j∈J⋆iuij−λ∑i∈M⋆1|J⋆i|, (21)

where , is the set of columns affected by thresholding, is the set of indices of the non-zero entries of , and

 μ⋆i=[∑j∈J⋆iuij−t⋆λ|J⋆i|]+, i=1,…,m. (22)

is the th entry of the vector .

###### Proof:

The Lagrangian of problem (19) is given by

 L(X,t,μ,{σi}mi=1)=t+12λm∑i=1∥xi−ui∥2+m∑i=1μi(1Txi−t)−m∑i=1σTixi . (23)

Since the problem is convex, the necessary and sufficient conditions for optimality are given by the KKT conditions:

• Zero gradient of the Lagrangian

 ∂L∂xk =1λ(xk−uk)+μk1−σk=0, ∀k (24) ∂L∂t =1−m∑i=1μi=0 (25)
• Primal and dual feasibility

 1Txk≤t,xk≥0, k=1,…,m (26) μ≥0,σk≥0, k=1,…,m (27)
• Complementary slackness

 μk(1Txk−t)=0, k=1,…,m (28) σk⊙xk=0, k=1,…,m, (29)

where denotes element-wise product.

We start by showing that equation (20) holds or equivalently, that every column of satisfies

 xk=[uk−λμk1]+, k=1,…,m. (30)

In order to do so, let be the set of columns that are affected by thresholding. Take for instance for some , then we have

 xk=uk−λμk1+λσk. (31)

In this case we can have which, by (29), (26), (27) implies . Alternatively, we can have which means . Therefore, both situations can be written in compact form as

 xk=[uk−λμk1]+, k∈M, (32)

where the thresholding operation is applied element-wise. Alternatively, take for some then from (28) it follows that and hence, . From (29) and the fact that it follows that for all . Therefore, we have that

 xk=uk, k∉M. (33)

Since for we can put together (32) and (33) into a single expression as in (30). It remains now to derive an expression that relates and . We know from (28) and the fact that for that

 (34)

where we the set denotes the non-zero entries of . Solving for in (34) leads to

 μk=∑j∈Jkukj−tλ|Jk|,k∈M. (35)

Recall that for we have and therefore, we can compactly express as

 μk=[∑j∈Jkukj−t⋆λ|Jk|]+, k=1,…,m,

and we recover the expression in (22). Finally, using equation (25) it is easy to check that

 t⋆=∑k∈M1|Jk|∑j∈Jkukj−λ∑k∈M1|Jk|,

which completes the proof. ∎

We are now ready to derive an expression for the proximal operator of the mixed norm as:

###### Corollary 1 (Proximal Operator)

The proximal operator in (16) is given by

 proxλ∥⋅∥1,∞(V)=sign(V)⊙[|V|−λ1μT]+, (36)

where is given as in Proposition 1.

###### Proof:

It follows directly from Lemma 1 and Proposition 1. ∎

The expression in Corollary 1 resembles very much the well-known soft-thresholding operator. In fact, the proximal operator of the mixed norm applies a soft-thresholding operation to every column of the matrix but with a different threshold value for each column (see Fig. 1). As expected, the above expression reduces to soft-thresholding for :

###### Corollary 2 (Soft-thresholding)

In the case so that is a vector, the proximal operator is given by the well-known soft-thresholding

 proxλ∥⋅∥1(v)=sign(v)⊙[|v|−λ1]+. (37)
###### Proof:

By setting we get from Proposition 1 that , where is the set of non-zero entries of the optimal vector . Substituting this value into (22) we get that . The result then follows from Corollary 1. ∎

###### Corollary 3 (Projection onto the ℓ∞,1 ball)

The projection onto the ball of radius is

 P∥⋅∥∞,1≤λ(V)=sign(V)⊙min(|V|,λ1μT), (38)

with given as in Proposition 1.

###### Proof:

The result follows from Corollary 1, (12) and (15). ∎

#### Remark

Note that the results presented in this section can be trivially extended to the case of complex-valued matrices by interpreting the sign operation as extracting the phase of a complex number (i.e., ).

## 4 Algorithms for computing the mixed ℓ1,∞ norm proximal operator

The results in Proposition 1 and Corollary 1 give us the basis for finding an efficient algorithm for computing the mixed norm proximal operator. However, the computation of the proximal operator directly from those expressions requires knowledge about the optimal sets and , which are not known a priori. In this section we present a procedure for addressing this issue. But first, we describe an efficient pre-processing stage that can be used to reduce the search space for the optimal sets and needed to compute the proximal operator in Proposition 1. The idea is to maximize a lower bound on the mixed norm of the optimal solution, which allows us to discard columns that will not be affected by thresholding hence, reducing the search space for the optimal sets and . This allows us to effectively reduce the dimensionality of the problem since our algorithm will be then applied to a smaller matrix (i.e., a matrix which contains a subset of columns of the original input matrix). After describing a procedure to maximize such lower bound we then propose a general iterative algorithm that uses the results in Proposition 1 and Corollary 1 to find the right solution.

### 4.1 A lower bound on the norm

It follows from the analysis presented in Section 3 that only a subset of the columns of the matrix might be affected by the proximal operator (i.e., those with norm larger than ). This fact can be exploited to reduce the search space of the problem provided that some knowledge about the value of is available. In particular, having a lower bound on would allow us to discard columns with smaller norm. It turns out that a simple lower bound can be derived from the optimality conditions as stated in the following result:

###### Lemma 2 (Lower-bound on the norm)

Let for some . Let be the mixed norm of the optimal solution. Then, for any subset

 tM=1|M|(∑i∈M∥vi∥1−nλ)≤t⋆. (39)
###### Proof:

From the optimality conditions of Problem (19) we know that , with , and that . Then, it follows that

 (40)

where the last inequality follows from the fact that . ∎

In order to reduce the search space of the problem, we can maximize the upper bound in (39) with respect to . Since the sum is maximized when we choose the columns of with the largest norm, a simple method to compute the set that maximizes is to sort the columns of according to their norm, evaluate the objective for the top columns, and choose the value of that maximizes , as described in Algorithm 1. Specifically, we form the vector that contains the norms of the columns of in decreasing order. From we compute the partial sums for , and evaluate the value of the bound (39) as described in Algorithm 1 to find its maximizer.

Note however, that while Algorithm 1 allows us to efficiently find a maximum lower bound for , depending on the parameter , the maximum lower bound might be smaller than zero, in which case it is not useful. In such case, an alternative lower bound for is given by the following result:

###### Lemma 3

Let for some . Let be the mixed norm of the optimal solution. Then, it holds that

 (41)
###### Proof:

From (25) we know that . It also holds that , hence

 mt⋆≥m∑i=11Tx⋆i=m∑i=11T[ui−λμ⋆i1]+≥m∑i=1(max1≤j≤nuij−λμ⋆i)=m∑i=1∥ui∥∞−λm∑i=1μ⋆i=∥U∥∞,1−λ=∥V∥∞,1−λ. (42)

Note that the bound in (41

) will be negative only if the optimal solution is the zero matrix since:

 ∥V∥∞,1<λ⟹P∥⋅∥∞,1≤λ(V)=V⟹proxλ∥⋅∥1,∞(V)=0. (43)

### 4.2 A general algorithm

A general procedure for computing the proximal operator of the mixed norm can be devised based on the optimality conditions of Proposition 1 and the observation that, for a fixed , the problem in (19) boilds down to projecting the columns of onto the ball of radius . A possible strategy for finding is to start with a lower bound for , project each column of whose norm is above the current lower bound onto the ball of radius , update the value of the lower bound using (21), and keep iterating until there are no further changes in (see Algorithm 2). This algorithm is guaranteed to converge to the optimal solution, as stated next.

###### Proposition 2 (Convergence)

Algorithm 2 converges to the proximal operator of the mixed norm of matrix in a finite number of iterations.

###### Proof:

Observe that the algorithm produces a monotonic sequence of values for that eventually converges to the optimal value . To see this, note that for a given , the projection onto the ball has the form of (30) that is, for some value . Let denote the value at the optimal solution. Now since is a lower bound on then it is necessary the case that (hence ). Let denote the resulting sets after projecting onto the ball of radius . The new value is then given by

 t+=∑i∈M1|Ji|∑j∈Jiuij−λ∑i∈M1|Ji|=∑i∈M1|Ji|∑j∈Ji(uij−μiλ+μiλ)−λ∑i∈M1|Ji|=∑i∈M1|Ji|t+μiλ−λ∑i∈M1|Ji|=t+λ∑i∈Mμi−1∑i∈M1|Ji|≥t, (44)

where the last inequality follows from the fact that and (see (25)). In fact, the inequality is strict and it is satisfied with equality only when since in that case the optimality conditions of Proposition 1 are satisfied. Note that a change in can only happen if there is a change in the sets or . Since the number of possible sets is finite and due to the monotonicity of the values of the algorithm terminates in a finite number of iterations. ∎

## 5 Numerical experiments

### 5.1 Complexity and implementation

The complexity of Algorithm 2 depends on the method used for computing the projection step onto the simplex and there exist different alternatives in the literature [16]. A naive implementation of the proposed algorithm can lead to a computationally inefficient method if at every iteration the projection step is computed from scratch. Alternatively, one could exploit previous estimates from one iteration to the next in order to improve the computational efficiency. In this paper we propose two different approaches for computing the projection step onto the simplex: the first one is based on sorting the columns of the matrix of absolute values that are affected by thresholding. In such case, the expected complexity of the method is dominated by the sorting operation and it is of operations. The second approach is based on an active set method based on Proposition 1. In fact, the projection onto the simplex part is equivalent to the one in [17]. While in the latter case the complexity analysis is not straightforward, we have experimentally observed it to be more efficient than the sorting-based one in most of the tested scenarios.

### 5.2 Numerical validation

In order to evaluate the computational complexity of the proposed algorithms we randomly generate matrices in

with independent and identically distributed random entries drawn from a uniform distribution

. We then apply the proposed implementations of Algorithm 2 and label them “Sort” for the sorting-based implementation and “Active Set” for the one based on active sets to compute projections onto the mixed ball for different radius values. We also compute the projections using the state of the art algorithms. In particular we compare to the method proposed in [8] which we denote as “QT” and with the recently proposed root-search based methods of [13, 18] which we denote as “ST” (Steffensen) and “NT” (Newton), respectively. We record the execution time for different configurations (sizes) of the data matrix and for different values of the ball radius. In our experiments, we choose the radius of the ball to be a fraction of the true mixed norm of the matrix and compute the average computation time over realizations. For the methods in [8, 13, 18] we use the implementations provided by the authors. The results for different matrix sizes are displayed in Table LABEL:tab:timing. As it can be observed from the table, our two implementations achieve the best performance offering an improvement over the state of the art that ranges between one and two orders of magnitude.

### 5.3 Application to cancer classification from gene expression data

In this section we test our algorithms in the context of multi-task learning for the problem of cancer classification from gene expression data where the dimensionality of the feature vectors is typically much larger than the number of samples . We use the datasets provided in [19] which consist of five curated datasets of different types of cancers as described in [19]. The datasets are briefly summarized in Table LABEL:tab:datasets. We pose the classification problem as a multi-task learning problem. In particular, given a dataset of points with associated labels , with and , where is the number of classes, we build a data matrix and target label matrix with

 (45)

The problem is to predict the correct label for each class while enforcing feature sharing among them:

 minimizeW∥Y−XWT∥2Fsubject to∥W∥∞,1≤τ. (46)

Note that problem (46) falls within the family of problems in (2) which can be solved using a projected gradient descent strategy. For the projection step onto the ball we use the sorting-based implementation of Algorithm 2.

We conducted an experiment using the datasets of Table LABEL:tab:datasets

where we center the data points (mean subtraction) and normalize them by dividing each coordinate by its standard deviation. For each dataset we split the data into

training and testing and computed the average classification performance over random data splits. Once we solve (46) we use the following simple classification rule:

 ^ci=argmax1≤j≤n^yij,^Y=XWT=[^y1,…,^yp]T. (47)

In addition, we use the learned weights to identify relevant features and train a (kernel) support vector machine (SVM) classifier on the identified features. Features are sorted according to the Euclidean norm of the columns of

being the most relevant index the one with larger norm. For the multi-class problem we use a one-versus-one strategy with majority voting. We also provide a comparison with the

norm based feature selection method of

[19] for which we used the implementation provided by the authors. The ball radius in (46) as well as the regularization parameter for the method in [19] were chosen using a grid search. The average classification accuracy of both methods the classification rule (47) are summarized in Table LABEL:tab:accuracy. As it can be appreciated we observe that the proposed method using the norm achieves better classification accuracy than the method based on the proposed in [19]. It is important to note that the differences are more pronounced in multi-class problems than in binary ones indicating as expected, that the norm encourages the discovery of variables that are most correlated.

We also report the classification results using an SVM classifier and for different number of features used. The results are displayed in Fig. 2 for all datasets. We can observe the superior performance of the proposed scheme in selecting relevant features for the discrimination task. Again the performance gap is generally more pronounced on those datasets with more than two classes.

## 6 Conclusions

In this paper we have analyzed in detail the proximity operator of the mixed matrix norm. We have provided simple expressions for its computation that generalize the well-known soft-thresholding algorithm. By exploiting the duality relationship to the norm we also derive the projection operator onto the mixed norm. In addition, we have proposed a general algorithm for the computation of the proximal operator and two particular implementations that can be orders of magnitude faster than the state of the art making them particularly suitable for large-scale problems. We have also illustrated the application of the norm for biomarker discovery (feature selection) for the problem of cancer classification from gene expression data.

## Acknowledgment

This research was supported in part by the Northrop Grumman Mission Systems Research in Applications for Learning Machines (REALM) initiative.

## References

• [1] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed.   New York, NY, USA: Springer, 2006.
• [2] S. Boyd and L. Vandenberghe, Convex Optimization.   New York, NY, USA: Cambridge University Press, 2004.
• [3] P. L. Combettes and J.-C. Pesquet, “Proximal Splitting Methods in Signal Processing,” in Fixed-Point Algorithms for Inverse Problems in Science and Engineering, H. Bauschke, R. Burachik, P. Combettes, V. Elser, D. Luke, and H. Wolkowicz, Eds.   Springer, 2011, pp. 185–212. [Online]. Available: https://hal.inria.fr/hal-00643807
• [4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Foundations and Trends® in Machine Learning, vol. 4, no. 1, pp. 1–106, 2012. [Online]. Available: http://dx.doi.org/10.1561/2200000015
• [5] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014. [Online]. Available: http://dx.doi.org/10.1561/2400000003
• [6] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection and joint subspace selection for multiple classification problems,” Statistics and Computing, vol. 20, no. 2, pp. 231–252, 2010. [Online]. Available: http://dx.doi.org/10.1007/s11222-008-9111-x
• [7] B. A. Turlach, W. N. Venables, and S. J. Wright, “Simultaneous variable selection,” Technometrics, vol. 47, no. 3, pp. 349–363, Aug. 2005.
• [8] A. Quattoni, X. Carreras, M. Collins, and T. Darrell, “An Efficient Projection for Regularization,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09.   New York, NY, USA: ACM, 2009, pp. 857–864. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553484
• [9] E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in

IEEE Conference on Computer Vision and Pattern Recognition

, 2012.
• [10] I. Dokmanić and R. Gribonval, “Beyond moore-penrose part I: generalized inverses that minimize matrix norms,” CoRR, vol. abs/1706.08349, 2017. [Online]. Available: http://arxiv.org/abs/1706.08349
• [11] I. Dokmanić and R. Gribonval, “Concentration of the frobenius norm of generalized matrix inverses,” SIAM Journal on Matrix Analysis and Applications, vol. 40, no. 1, pp. 92–121, 2019.
• [12] S. Sra, “Fast projections onto -norm balls for grouped feature selection,” in Machine Learning and Knowledge Discovery in Databases.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 305–317.
• [13] G. Chau, B. Wohlberg, and P. Rodríguez, “Fast projection onto the mixed norm ball using steffensen root search,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Alberta, Canada, Apr. 2018, pp. 4694–4698.
• [14] J. J. Moreau, “Proximité et dualtité dans un espace Hilbertien,” Bulletin de la Societé Mathématique de France, vol. 93, pp. 273–299, 1965.
• [15] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of waterfilling solutions,” IEEE Transactions on Signal Processing, vol. 53, no. 2, pp. 686–695, Feb 2005.
• [16] L. Condat, “Fast projection onto the simplex and the ball,” Mathematical Programming, vol. 158, no. 1, pp. 575–585, 2016. [Online]. Available: http://dx.doi.org/10.1007/s10107-015-0946-6
• [17] C. Michelot, “A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of ,” Journal of Optimization Theory and Applications, vol. 50, no. 1, pp. 195–200, 1986.
• [18] G. Chau, B. Wohlberg, and P. Rodríguez, “Efficient projection onto the mixed-norm ball using a newton root search method,” SIAM Journal on Imaging Sciences, 2019, accepted for publication.
• [19] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and Robust Feature Selection via Joint -Norms Minimization,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’10.   USA: Curran Associates, Inc., 2010, pp. 1813–1821.
• [20] A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F. Frierson, and G. M. Hampton, “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer Research, vol. 61, no. 20, pp. 7388–7393, 2001. [Online]. Available: http://cancerres.aacrjournals.org/content/61/20/7388
• [21] K. Yang, Z. Cai, J. Li, and G. Lin, “A stable gene selection in microarray data analysis.” BMC Bioinformatics, vol. 7, pp. 228 – 243, 2006. [Online]. Available: http://search.ebscohost.com.proxy1.library.jhu.edu/login.aspx?direct=true&db=asn&AN=43698712&site=ehost-live&scope=site
• [22] C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor, P. M. Black, A. von Deimling, S. L. Pomeroy, T. R. Golub, and D. N. Louis, “Gene expression-based classification of malignant gliomas correlates better with survival than histological classification,” Cancer Research, vol. 63, no. 7, pp. 1602–1607, 2003. [Online]. Available: http://cancerres.aacrjournals.org/content/63/7/1602
• [23] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, and M. Meyerson, “Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 24, pp. 13 790–13 795, 2001. [Online]. Available: http://www.jstor.org/stable/3057173
• [24] S. P. Fodor, “Massively parallel genomics.” Science, vol. 277, no. 5324, pp. 393 – 395, 1997.
• [25] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers, “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, pp. 203 – 209, 2002. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1535610802000302