1 Introduction
The cardinality constraint is an intrinsic way to restrict the solution structure in many real problems, for example, sparse learning (Olshausen and Field, 1997), feature selection (Zhang, 2009), and compressed sensing (Candes et al., 2006). The generic cardinality constrained optimization can be expressed as
(1a)  
subject to  (1b) 
where is the optimization variable, is an index subset of ,
is the subvector of
indexed by . denotes the cardinality of the subvector, i.e., the number of nonzeros in , is the hyper set of all predefined groups, and is the upper bound vector  refers to the upper bound of the sparsity over group . Objectiveis the loss function which could be defined with different form according to the specific application. The problem (
1) refers to a nonconvex optimization (NPhard) due to the cardinality constraint. Some efficient iterative methods such as IHT (Yuan et al., 2014), CoSaMP (Needell and Tropp, 2009), GradMP (Nguyen et al., 2012), and their variants can guarantee to solve the original problem under some mild conditions. A key component in all of these methods is the projection operator(2) 
where denotes the feasible set to the constraint (1b). While in some special case, for example, , the projection is trivial, it is quite challenging, especially when includes multiple overlapped index sets (even NPhard in some cases).
In this paper, we consider the scenario where the overlapped cardinality constraints (1b) satisfy a Threeview Cardinality Structure (TVCS):
Definition 1.
(Threeview Cardinality Structure (TVCS)) For , the hyper set consisting of subsets of admits the TVCS structure if the following conditions are satisfied:

There exists a partition , and such that ;

;

All element sets in have no overlap;

All element sets in have no overlap.
This definition basically requires that can be partitioned into three hyper sets , , and , and overlaps can only happen between element sets in different hyper sets. is usually used to restrict the overall sparsity. Figure 1 provides two examples of for TVCS.
The TVCS model is motivated from some important applications, for example, in recommendation, taskworker assignment, and bioinformatics.

Online recommendation. Suppose we want to recommend a certain number of books (among books) to a customer  corresponding to the based sparsity constraint. Among the selected books, we want to maintain some diversities  the recommended books by the same author should not exceed a certain number ( based sparsity constraint) and about the same topic should not exceed a certain number either ( based sparsity constraint). One can refer to the top graph in Figure 1: is grouped by authors and is grouped by topics.

Taskworker assignment. Suppose we have a bunch of tasks and workers, and we want to assign the tasks to workers. For example, in crowdsourcing, we usually assign several different workers to each task since we want to use the answers from multiple workers to improve the accuracy. On the other hand, each worker is usually assigned to multiple tasks so there is a “many to many” relationship in this assignment. The goal is to pursue the optimal assignment under a certain criteria in crowdsourcing, while satisfying some restrictions. For example, the total assignments should be bounded by the total budget (corresponding to ), the total cost of assignments to a single worker cannot exceed a certain threshold (corresponding to ), and the total cost of assignments on a single task cannot exceed a certain threshold (corresponding to ). Let be the assignment matrix, and its rows are indexed by workers and the columns are indexed by tasks. These constraints can be illustrated by the bottom graph in Figure 1.

Identification of gene regulatory networks. The essential goal of identifying gene regulatory network is to identify a weighted directed graph, which can be represented by a square matrix with elements in total where is the number of vertices. A sparse network constraint is to restrict the indegree and outdegree for each vertex, which corresponds to the sparsity in each row and column of .
To solve the TVCS constrained projection (2), we show an interesting connection between the projection and a linear programming (LP) that the vertex solution to this linear programming is an integer solution which solves the original problem.
To find an integer solution to such LP efficiently, we formulate it into a feasibility problem, and further an equivalent quadratic convex optimization. By using the rounding technique, we can avoid solving the exact solution of this LP problem. We propose an iterative algorithm to solve it and each iteration can be completed in linear time. We also show that the iterate linearly converges to the optimal point. Finally, the proposed TVCS model is validated by the synthetic experiment and two important and novel applications in identification of gene regulatory networks and task assignment problem of crowdsourcing.
2 Related Works
Recent years have witnessed many research works in the field of structured sparsity and groupbased sparsity. Yuan and Lin (2006) introduced the group LASSO, which pursues groupwise sparsity that restricts the number of groups for the selected variables. Jenatton et al. (2011) construct a hierarchical structure over the variables and use group LASSO with overlapped groups to solve it. Exclusive LASSO (Zhou et al., 2010; Kong et al., 2014) was proposed for the exclusive group sparsity which can be treated as relaxing our cardinality constraints to convex regularizations. In (Kong et al., 2014), the authors discussed the overlapping situation and tried to solve the problem using convex relaxation, which is different from our approach. Besides the aforementioned works, some proposed more general models to cover various sparsity structures. Bach et al. (2012) extended the usage of norm relaxation to several different categories of structures. And recently, another generalization work (El Halabi and Cevher, 2015) proposed convex envelopes for various sparsity structures. They built the framework by defining a totally unimodular penalty, and showed how to formulate different sparsity structures using the penalty. The work above concentrated on using convex relaxation to control the sparsity.
Besides using convex relaxation, there are also some works focusing on projectionbased methods. When the exact projection operator was provided, Baraniuk et al. (2010) extended the traditional IHT and CoSaMP methods to general sparsity structures. In this work, the authors also introduced the projection operator for block sparsity and tree sparsity. Cevher et al. (2009) investigated cluster sparsity and they applied dynamic programming to solve the projection operator for their sparsity model. Hegde et al. (2009) introduced a “spike trains” signal model, which is also related to exclusive group sparsity. Its groups always have consecutive coordinates, and each group cannot contain more than one nonzero element. To solve the projection problem of their model, they showed the basic feasible solutions of the relaxed linear programming (LP) are always integer points. In our work, we also use LP to solve the projection problem. But our model defines the group structure differently and aims at different applications.
In addition, there are some works for the cases without an efficient exact projection operator (Hegde et al., 2015a, b; Nguyen et al., 2014)
. This is meaningful since the projection operator for complex structured sparsity often involves solving complicated combinatorial optimization problems.
Hegde et al. (2015a) discussed how to guarantee convergence if using approximate projection in IHT and CoSaMP for compressive sensing. They proved that the convergence needs a “head approximation” to project the update (gradient) before applying it. Hegde et al. (2015b) proposed a general framework to formulate a series of models as a weighted graph, and designed an efficient approximate projection operator for the models. Nguyen et al. (2014) applied the approximate projectionbased IHT and CoSaMP to general convex functions and stochastic gradients.3 Preliminary: GradMP and IHT Frameworks
This section briefly reviews two commonly used algorithm frameworks to solve the cardinality constrained optimization (1): iterative hard thresholding (IHT) (Yuan et al., 2014; Nguyen et al., 2014) and gradient matching pursuit (GradMP) (Nguyen et al., 2012, 2014) (the general version of CoSaMP (Needell and Tropp, 2009)) for solving cardinality constrained problem. Other methods like hard thresholding pursuit (HTP) also follows similar steps and has been shown to be effective both empirically and theoretically (Yuan et al., 2016). The procedures of IHT and GradMP for our model are shown in Algorithms LABEL:alg:stoiht and LABEL:alg:gradmp, where is the support set of the argument vector.
Therefore, one can see that the efficiency of both algorithms relies on the computation of the gradient and the projection. To avoid the expensive computation of the gradient, GradMP and IHT can be extended to the stochastic versions (Nguyen et al., 2014) by assigning the stochastic gradient at the gradient computation step.
Both Algorithms LABEL:alg:stoiht and LABEL:alg:gradmp (and their stochastic variants) guarantee some nice properties: the iterate converges to a small ball surrounding the true solution at a linear rate under certain RIPtype conditions (Nguyen et al., 2014) and the radius of such ball converges to zero when the number of samples goes to infinity.
algocf[htbp]
algocf[htbp]
A common component in Algorithms LABEL:alg:stoiht and LABEL:alg:gradmp is the projection operator. If all the groups except in do not overlap each other, the projection problem can be easily solved by sequential projections (Yang et al., 2016). But for those cases involving overlapped groups, it is generally challenging to solve them efficiently.
4 Projection Operator
This section introduces how to solve the essential projection step. Note that the projection onto a nonconvex set is NPhard in general. By utilizing the special structure of TVCS, we show that the projection can be solved efficiently. Due to the page limitation, all proofs are provided in the supplementary material.
4.1 LP Relaxation
Firstly, we can cast the projection problem (2) to an equivalent integer linear programming problem (ILP) according to Lemma 1.
Lemma 1.
The projection problem (2) is equivalent to the following integer linear programming problem (ILP):
(3)  
subject to  
where is applying elementwise square operation on vector . is a matrix which is defined as:
(4) 
where , whose rows represent the indicator vector of each group and .
Each row in corresponds to one group from . For example, if the th coordinate is in the th group, otherwise . The first row corresponds to the overall sparsity i.e. .
It is NPhard to solve an ILP in general. One common way to handle such ILP is making a linear programming (LP) relaxation. In our case, we can use a box constraint to replace the integer constraint :
(5)  
subject to  
However, there is no guarantee that a general ILP can be solved via its LP relaxation, because the solution of the relaxed LP is not always integer. Although one can make a rounding to the LP solution and acquire a integer solution, such solution is not guaranteed to be optimal (or even feasible) to the original ILP.
Fortunately, due to the special structure of our TVCS model, we find that its relaxed LP has some nice properties which make it possible to get the optimal solution of the ILP efficiently. The following theorem reveals the relationship between the ILP problem and the relaxed LP problem.
Theorem 2.
This theorem suggests that finding a vertex solution of the relaxed LP can solve the original projection problem onto a TVCS . The proof basically shows that matrix (for TVCS) is a totally unimodular matrix (Papadimitriou and Steiglitz, 1982). We provide the detailed proof in the supplementary material.
4.2 Linearly Convergent Algorithm for Projection Operator onto TVCS
To find a solution on the vertex, one can use the Simplex method. Although Simplex method guarantees to find an optimal solution on the vertex and could be very efficient in practice, it does not have a deterministic complexity bound. In the IHT and GradMP algorithms, projection operator is only a subprocedure in one iteration. Hence, we are usually supposed to solve lots of instances of problem (3). Simplex might be efficient practically, but its worst case may lead to exponential time complexity (Papadimitriou and Steiglitz, 1982). In this section, the integer solution to the linear programming can be found within the complexity proportional to the number of variables and constraints.
Equivalent Feasibility Problem Formulation.
The dual of LP problem (5) can be written as:
(6)  
subject to 
Iterative Algorithm.
The feasibility problem with linear constraints above is equivalent to the following optimization problem:
(7)  
subject to 
where is the elementwise hinge operator, i.e. it transforms each element to .
This is a convex optimization problem with a quadratic objective and box constraints. We adopt the projected gradient descent to solve this problem, and show it converges linearly.
Theorem 3.
For the optimization problem with the form
subject to 
where , the projected gradient descent algorithm has a linear convergence rate with some (depending on and ):
where is the projection onto the optimal solution set.
Notice that the objective function in Theorem 3 is not necessarily strongly convex, which means the well recognized linear convergence conclusion from the strong convexity is not applicable here.
Theorem 3 mainly applies Hoffman’s Theorem (Hoffman, 2003) to show that is an optimal strongly convex function (Liu and Wright, 2015). This leads to a linear convergence rate.
The convergence rate , where is the Hoffman constant (Hoffman, 2003) that depends on and is always positive. is the Lipschitz continuous gradient constant. More details are included in the supplementary materials.
To show the complexity of this algorithm, we firstly count how many iterations we need. Since we know that we can just make a rounding^{1}^{1}1
Acute readers may notice that the convergent point may be on the face of the polytope in some cases instead of vertex. However, we can add a small random perturbation to ensure the optimal point to be vertices with probability 1.
to the result when we attain . Let represent all the variables in (7). Because , we can do the rounding safely when , where are the optimal points of this problem. According to Theorem 3, we have the linear convergence rate , so the number of iterations we need isTherefore, we claim that we can obtain the solution by rounding after iterations.
Secondly, we show that the computation complexity in each iteration is linear with dimensionality and the amount of groups . Since each column of contains at most 3 nonzero elements, the complexity of the matrix multiplications in computing the gradient of (7) is . Together with other computation, the complexity for each iteration is .
5 Empirical Study
This section will validate the proposed method on both synthetic data and two practical applications: crowdsourcing and identification of gene regulatory networks.
5.1 Linear Regression and Classification on Synthetic Data
In this section, we validate the proposed
method with linear regression objective and squared hinge objective (classification) on
synthetic data. Let be a matrix, and are defined as groups with all rows and all columns respectively. The linear regression loss is defined as and the squared hinge loss is defined as , where is the total number of training samples. and are the features and label of the th sample respectively.In the linear regression experiment, the true model
is generated from the following procedure: generate a random vector and apply the projection operator to get a support set which satisfy our sparsity constraints; the elements of positions in support set are drawn from standard normal distribution.
is fixed as and is gradually increased. The group sparsity upper bounds for and are uniformly generated from the integers in the range. The overall sparsity upper bound is set by . Each ’s is ani.i.d. Gaussian random matrix.
is generated from , where is the i.i.d. Gaussian random noise drawn from . We compare the proposed methods to bilevel exclusive sparsity with nonoverlapped groups (rowwise or columnwise) (Yang et al., 2016), overall sparsity (Needell and Tropp, 2009), and exclusive LASSO (Kong et al., 2014). For fairness we project the final result of all the compared methods to satisfy all constraints. All the experiments are repeated 30 times and we use the averaged result. We use selection recall and successful recovery rate to evaluate the performance. Selection recall is defined as , where is the optimization result. Successful recovery rate is the ratio of the successful feature selection i.e. to the total number of repeated experiments. In Figure 2 we can observe that our model with all sparsity constraints always have the best performance. While the performance of exclusive LASSO and our method is comparable when the number of samples are very limited, our method outperforms exclusive LASSO when the number of samples increases.For classification experiments, we use the same settings of sparsity with linear regression. Here we set , and change from to . The true model and feature matrices are generated by the same way as the linear regression experiment. The class label is got by . Besides the selection recall, we also compare the classification error. In Figure 3, we can see that the superiority of our method is even more significant in the classification experiment. Although the overall sparsity has the lowest selection recall, it still has a similar classification error with the methods that consider row or column groups.
5.2 Application in Crowdsourcing
This section applies the proposed method to the workertask assignment problem in crowdsourcing. Take the image labeling task as an example. Given workers and images, each image can be assigned to multiple workers and each worker can label multiple images. The predicted label for each image is decided by all the labels provided by the assigned workers and the quality of each worker on the image. The goal is to maximize the expected prediction accuracy based on the assignment. Let be the assignment matrix, i.e. if assign the th worker to th task, otherwise .
is the corresponding quality matrix, which is usually estimated from the golden standard test
(Ho et al., 2013). The whole formulation is defined to maximize the average of the expected prediction accuracy over tasks over a TVCS constraint:(8)  
subject to  
where is the expected prediction accuracy, is the “worker sparsity”, i.e. the largest number of assigned workers for each task, and is the “task sparsity”, i.e. each worker can be assigned with at most tasks, and is the total sparsity to control the budget, i.e., the maximal number of assignment. In image labeling task, we assume that each image can only have two possible classes and the percentage of images in each class is one half. We use the Bayesian rule to infer the predicted labels given the workers’ answer. Here we consider the binary classification task. Let be the true label of the th task and be the prediction given labels by selected workers, i.e.,
where is the th worker’s predication on th task. Set contains the indices of the selected workers for th task, i.e. , and
Then can be defined in the following:
By this way, the expected accuracy will not be continuous, so we use smooth function to approximate the expected accuracy and adopt the stochastic gradient with the proposed projection operator to optimize it. Due to the space limitation, the detailed derivation of the objective formulation can be found in the supplemental material.
We conduct experiment for crowdsourcing task assignment on synthetic data. Specifically, we generate the quality matrix from uniformly random distribution with interval
. The prior probability
and are set as for all the tasks.To avoid evaluating the expectation term, we apply the stochastic iterative hard thresholding framework (Nguyen et al., 2014). Each iteration we get and by sampling based on i.e. , . Then we can get a stochastic gradient based on the sampled .
Besides the proposed formulation (8), we evaluate the random assignment algorithm and the Qbased linear programming (Ho et al., 2013). The random assignment algorithm widely used in practice is the most straightforward approach: given the total assignment budget and the restrictions ( and ) for workers and tasks, randomly assign tasks to the workers. The Qbased linear programming uses the linear combination of over to evaluate the overall accuracy on task for simpler formulation. In addition, it does not consider the restriction on tasks, thus it may assign lots of workers to a difficult task^{2}^{2}2A “difficult” task means that all workers’ qualities are low on this task.. To make a fair comparison, the task restriction is added into this method. To get the assignment result which satisfies the task and worker restriction, we use our projection operator in the other methods too.
We evaluate the experiments on different value of by setting them as different ratios of the total number of tasks and workers. The overall sparsity is set by the same way as in Section 5.1. To measure the performance, we compare the sampled expected accuracy. The samples (i.e., ) are independent to the samples used in training. Figure 4 shows the comparison of the expected accuracy of the three approaches. We can observe that the accuracy increases with larger ratio (i.e. more assignments). The random assignment strategy needs more assignments to get the same accuracy compared with the other two methods.
5.3 Application in Identification of Gene Regulatory Networks
In this section, we apply the projection operator to the identification of gene regulatory networks (GRN).
Background.
Gene regulatory network represents the relations between different genes, which plays important roles in biological processes and activities by controlling the expression level of RNAs. There is a wellknown biological competition named DREAM challenge about identifying
GRN. Based on the time series gene expression data which are RNAs’ level along time sequence, contestants are required to recover the whole gene network of given size. One popular way to infer GRN is to utilize the sparse property of GRN: e.g., one gene in the network is only related to a small number of genes and we already know that there exists no relationship between some genes. Therefore, the amount of edges connecting to one vertex is far less than the dimension of the graph. It is a practical case of rowwise and columnwise sparsity for matrix. We could apply the projection operator to constrain the number of edges related to each vertex to identify the whole network. Recently, the dynamic Bayesian network (DBN)
(Zou and Conzen, 2005) is supposed to be an effective model to recover GRNs. The RNAs’ level of all genes in GRN at time is stored in gene expression vector , where each entry corresponds to one gene respectively and is the number of genes in GRN. Hence, We define the total amount of time points in the experiment as . Gene activity model is usually assumed to bewhere is the covariance matrix of GRN and
is Gaussian white noise. Then the difference of RNA levels between time points
and , i.e. is defined as follows:where is the true sparse by matrix. Therefore, the GRN is only considered between different genes and we eliminate edges whose start and end vertex are the same. We define that and . The objective function is
Timecourse Gene Expression Data.
To evaluate our method, we employ GeneNetWeaver (Marbach et al., 2009; Schaffter et al., 2011)
, the official DREAM Challenge tool for timeseries expression data generation. With typical gene network structure and ordinary differential equation (ODE) models, GeneNetWeaver will produce the timecourse gene expression data at prespecified time points. In the simulation studies, we control the size of gene network to be
vertexes and the gene expression data are generated under 10% Gaussian white noise.The network is shown in Figure 5. In this Figure, it is clear that one gene only has a few connections to other genes. Therefore, the GRN is sparse and we are able to restrict the indegree and outdegree of every vertex by representing the network as a matrix and controlling the sparsity within each row and column.
Performance evaluation.
Six commonlyused criteria are considered to measure the performance, i.e., sensitivity (SN), specificity (SP), accuracy (ACC), Fmeasure, Matthews correlation coefficient (MCC), and the Area Under ROC Curve (AUC):
where TP and TN denote the true positive and true negative, and FP and FN denote the false positive and false negative, respectively. With these criteria, we compare the performance of our method with six representative algorithms, including PCC, ARACNE (Margolin et al., 2006), CLR (Faith et al., 2007), MINET (Meyer et al., 2008), GENIE3 (HuynhThu et al., 2010), TIGRESS (Haury et al., 2012). The results are summarized in Table 1. Our method outperforms other six stateofart methods: the AUC of our method achieve 0.7 higher which is far more than other methods; 5 out of 6 different measure show that our method has significant advantage compared to other algorithms.
SN  SP  ACC  Fmeasure  MCC  AUC  

Our Method  0.68750.0295  0.73970.0319  0.71190.0305  0.71260.0306  0.42640.0611  0.71360.0306 
GENIE3  0.56110.0277  0.49840.0547  0.53190.0244  0.52790.0277  0.05950.0547  0.56620.0244 
CLR  0.51670.0583  0.44760.1147  0.48440.0575  0.47950.0583  0.03570.1147  0.52100.0575 
TIGRESS  0.13330.0541  0.83020.0367  0.45850.0374  0.22580.0817  0.05520.1061  0.55670.0358 
PCC  0.50420.0124  0.43330.0245  0.47110.0101  0.46610.0124  0.06250.0245  0.50910.0101 
ARACNE  0.11670.0519  0.91270.0579  0.48810.0197  0.20510.0519  0.04790.0579  0.58080.0197 
MINET  0.57640.0425  0.53810.0888  0.55850.0458  0.55470.0425  0.11470.0888  0.59100.0458 
6 Conclusion
This paper considers the TVCS constrained optimization, motivated by the intrinsic restrictions for many important applications, for example, in bioinformatics, recommendation system, and crowdsourcing. To solve the cardinality constrained problem, the key step is the projection onto the cardinality constraints. Although the projection onto the overlapped cardinality constraints is NPhard in general, we prove that if the TVCS condition is satisfied the projection can be reduced to a linear programming. We further prove that there is an iterative algorithm which finds an integer solution to the linear programming within time complexity , where is the distance from the initial point to the optimization solution and is the convergence rate. We finally use synthetic experiments and two interesting applications in bioinformatics and crowdsourcing to validate the proposed TVCS model.
Acknowledgements
This project is supported in part by the NSF grant CNS1548078 and the NEC fellowship.
References
 Bach et al. [2012] F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al. Structured sparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.
 Baraniuk et al. [2010] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Modelbased compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982–2001, 2010.
 Candes et al. [2006] E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.
 Cevher et al. [2009] V. Cevher, P. Indyk, C. Hegde, and R. G. Baraniuk. Recovery of clustered sparse signals from compressive measurements. Technical report, DTIC Document, 2009.

El Halabi and Cevher [2015]
M. El Halabi and V. Cevher.
A totally unimodular view of structured sparsity.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, pages 223–231, 2015.  Faith et al. [2007] J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. J. Collins, and T. S. Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS biol, 5(1):e8, 2007.
 Haury et al. [2012] A.C. Haury, F. Mordelet, P. VeraLicona, and J.P. Vert. Tigress: trustful inference of gene regulation using stability selection. BMC systems biology, 6(1):145, 2012.
 Hegde et al. [2009] C. Hegde, M. F. Duarte, and V. Cevher. Compressive sensing recovery of spike trains using a structured sparsity model. In SPARS’09Signal Processing with Adaptive Sparse Structured Representations, 2009.
 Hegde et al. [2015a] C. Hegde, P. Indyk, and L. Schmidt. Approximation algorithms for modelbased compressive sensing. Information Theory, IEEE Transactions on, 61(9):5129–5147, 2015a.

Hegde et al. [2015b]
C. Hegde, P. Indyk, and L. Schmidt.
A nearlylinear time framework for graphstructured sparsity.
In
Proceedings of the 32nd International Conference on Machine Learning (ICML15)
, pages 928–937, 2015b.  Ho et al. [2013] C.J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In Proceedings of The 30th International Conference on Machine Learning, pages 534–542, 2013.
 Hoffman [2003] A. J. Hoffman. On approximate solutions of systems of linear inequalities. In Selected Papers Of Alan J Hoffman: With Commentary, pages 174–176. 2003.
 HuynhThu et al. [2010] V. A. HuynhThu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expression data using treebased methods. PloS one, 5(9):e12776, 2010.
 Jenatton et al. [2011] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12(Jul):2297–2334, 2011.
 Kong et al. [2014] D. Kong, R. Fujimaki, J. Liu, F. Nie, and C. Ding. Exclusive feature learning on arbitrary structures via norm. In Advances in Neural Information Processing Systems, pages 1655–1663, 2014.
 Liu and Wright [2015] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015.
 Marbach et al. [2009] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of computational biology, 16(2):229–239, 2009.
 Margolin et al. [2006] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. D. Favera, and A. Califano. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC bioinformatics, 7(Suppl 1):S7, 2006.
 Meyer et al. [2008] P. E. Meyer, F. Lafitte, and G. Bontempi. minet: Ar/bioconductor package for inferring large transcriptional networks using mutual information. BMC bioinformatics, 9(1):461, 2008.
 Needell and Tropp [2009] D. Needell and J. A. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
 Nguyen et al. [2012] N. Nguyen, S. Chin, and T. D. Tran. A unified iterative greedy algorithm for sparsity constrained optimization. 2012.
 Nguyen et al. [2014] N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. arXiv preprint arXiv:1407.0088, 2014.
 Olshausen and Field [1997] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.
 Papadimitriou and Steiglitz [1982] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1982.
 Schaffter et al. [2011] T. Schaffter, D. Marbach, and D. Floreano. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263–2270, 2011.
 Yang et al. [2016] H. Yang, Y. Huang, L. Tran, J. Liu, and S. Huang. On benefits of selection diversity via bilevel exclusive sparsity. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016.
 Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
 Yuan et al. [2014] X. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit for sparsityconstrained optimization. In Proceedings of The 31st International Conference on Machine Learning, pages 127–135, 2014.
 Yuan et al. [2016] X. Yuan, P. Li, and T. Zhang. Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems, pages 3558–3566, 2016.
 Zhang [2009] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(Mar):555–568, 2009.
 Zhou et al. [2010] Y. Zhou, R. Jin, and S. C. Hoi. Exclusive lasso for multitask feature selection. In AISTATS, volume 9, pages 988–995, 2010.
 Zou and Conzen [2005] M. Zou and S. D. Conzen. A new dynamic bayesian network (dbn) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics, 21(1):71–79, 2005.
1 Proof of Lemma 1
Firstly we show how to convert the projection problem (2) to a support set selection problem. For any vector , let vector indicate the nonzero positions of , then we can claim that
where is a vector having same dimension with , and it keeps elements at positions where has “1”, and fills zeros at positions where has “0”. In addition, vector if and only if its support set indicator vector satisfies , given is defined in (4).
2 Proof of Theorem 2
To prove Theorem 2, we use the concept of totally unimodular matrix.
Definition 2.
Totally Unimodular (TU) Matrix. An integer matrix is TU, if the determinant of any square submatrices^{3}^{3}3Submatrix here is a square smaller matrix obtained by removing certain rows and columns is in the set .
Proposition 1.
If is TU, then is TU, and their concatenations with identity matrices (i.e. ) are still TU.
Proof.
Since transposing matrix will not change the determinant, so it is obvious is TU.
Then we prove stacking with identity matrix
preserves TU property. We prove it by induction. Firstly, we show that submatrix with size 1 always has determinant in , because any element from is either 1 or 0. Now consider a submatrix with size having determinant in , then submatrix with size will still have determinant in . To show this, we only need to prove that adding a new row/column from will not change the determinant out of set . Since any row/column from only has one nonzero element “1”, we can eliminate other elements in the same position by subtracting a multiple of this row/column to other rows/columns. After that, we can remove this row and column, and the determinant can only change its sign. So we know that submatrix with size has determinant in if submatrix with size has determinant in . ∎Lemma 4.
If is TU, is an integer vector, then all vertices of the following polytope are integer points:
(10) 
Proof.
Lemma 5.
If is the matrix whose each row is the indicator vector of a group in our TVCS model, then is a TU matrix.
Proof.
Since ’s rows are the indicator vectors of groups in , so . From the definition 1, we know that there are at most two “1”s in each column. For the column which has two “1”s, the two corresponding groups are from and respectively. (Because we know that groups within or do not overlap.)
By this way, our matrix meets the case in Theorem 13.3 (see Papadimitriou and Steiglitz, 1982, chap 13), and it is a TU matrix. ∎
Lemma 6.
Proof.
From Lemma 5, we know is a TU matrix for any of our TVCS model. In other words, any submatrix restricted in has determinant , , or . Therefore, we only need to consider the submatrix of has overlaps with the first row . There are only three possible forms of such submatrix . We will show all of their determinants are in .

At least one column of has a single “”, so it must appear in the first row . By exchanging such column with the last column (which can only influence the sign of determinant), we can transform it with form:
where is any submatrix of . From the matrix determinant property, we have . Therefore, submatrix in such form have determinants in .

All columns of have three “” elements (the last row has “1” for every column). For the rows which are from , we can sum all the rows to a certain row (this will not change the determinant). By this way we transform to the following form:
where is a submatrix of . In this case, is not full rank, so its determinant is 0.

Each column in contains at least two “” elements, and there exists one column which has exactly two “”s. By exchanging it to the last column, we can transform it to be:
This means that one “1” is in the first row, and another is in the row from , let us say it’s . Since subtracting one row from another row will not change the determinant, we can let the first row subtract :
Now the last column only has a single “1” in the th row. We can generate a smaller matrix by removing the th row and the last column, and if has determinant in , so does .
If , then there are some positions (including th column) in the first row will become zeros. For any column of matrix which has “0” element in the first row, there are two cases:

This column only contains zeros, i.e. has zero determinant.

This column contains a single “1”, we can generate a smaller matrix by removing this column and the row where this “1” sits. If has determinant in , so does .
Notice that it is impossible for the case that such column has two “1”s. (Since each column can have at most three “1”s, and we already remove the “1” in the first row by subtraction, and discard another “1” by removing .) In the above case (b), we can repeat removing columns and rows until we get a degenerate matrix (has 0 determinant), or a matrix whose first row does not have zeros. For the later situation, we can process it by same procedures as the original matrix unless it only has one row and one column, i.e. a matrix having single element “1” (has determinant 1).
If , we can also process it by same procedures as the original matrix .

Therefore, we have proved that any square submatrix in has determinant in , which means is TU, and hence is TU. ∎
3 Proof of Theorem 3
To prove Theorem 3, we start with several lemmas.
Lemma 7.
Proof.
Since , there exists at least an such that
From Hoffman’s Theorem (Hoffman, 2003), we know that there exists a , such that
Therefore, we know for any in ,
and
∎
Using the lemma above, we can prove the Theorem 3 now.
Proof.
Denote by . We have
Let . Then we have
where is the Lipschitz continuous gradient constant. Back to the original inequality, we have
where the last inequality comes from Lemma 7.
Let and we have
which shows the linear convergence rate , then it completes the proof. ∎
4 Formulation of the Expected Accuracy in Crowdsourcing Task Assignment
In crowdsourcing task assignment problem, recall the objective function of problem (8):
For the th task, is defined in the following:
(12) 
where is the indicator function. We can further specify this formulation by considering the equivalent forms for and :
Similar derivation can be applied to (change “” to “”). Here we substitute the indicator function
to obtain a smooth approximation. Denote by and for short. The (smooth) objective turns out to be:and its stochastic gradient is:
Comments
There are no comments yet.