A variety of applications share the core task of selecting a subset of columns from a short, wide matrix with rows and columns. The criteria for selecting these columns typically aim at preserving information about the span of while generating a well-conditioned submatrix. Classical and recent examples include experimental design, where we select observations or experiments ; preconditioning for solving linear systems and constructing low-stretch spanning trees (here is a version of the node-edge incidence matrix and we select edges in a graph) [6, 4]; matrix approximation [11, 13, 24]
; feature selection in-means clustering [10, 12]; sensor selection  and graph signal processing [14, 41].
In this work, we study a randomized approach that holds promise for all of these applications. This approach relies on sampling columns of
according to a probability distribution defined over its submatrices: the probability of selecting a setof columns from , with , is
where is the submatrix consisting of the selected columns. This distribution is reminiscent of volume sampling, where columns are selected with probability proportional to the determinant of a matrix, i.e., the squared volume of the parallelepiped spanned by the selected columns. (Volume sampling does not apply to as the involved determinants vanish.) In contrast, uses the determinant of an matrix and uses the volume spanned by the rows formed by the selected columns. Hence we refer to -sampling as dual volume sampling (DVS).
Despite the ostensible similarity between volume sampling and DVS, and despite the many practical implications of DVS outlined below, efficient algorithms for DVS are not known and were raised as open questions in . In this work, we make two key contributions:
We develop polynomial-time randomized sampling algorithms and their derandomization for DVS. Surprisingly, our proofs require only elementary (but involved) matrix manipulations.
We establish that is a Strongly Rayleigh measure , a remarkable property that captures a specific form of negative dependence. Our proof relies on the theory of real stable polynomials, and the ensuing result implies a provably fast-mixing, practical MCMC sampler. Moreover, this result implies concentration properties for dual volume sampling.
In parallel with our work,  also proposed a polynomial time sampling algorithm that works efficiently in practice. Our work goes on to further uncover the hitherto unknown “Strong Rayleigh” property of DVS, which has important consequences, including those noted above.
1.1 Connections and implications.
The selection of columns from a short and wide matrix has many applications. Our algorithms for DVS hence have several implications and connections; we note a few below.
Experimental design. The theory of optimal experiment design explores several criteria for selecting the set of columns (experiments) . Popular choices are
Here, denotes the Moore-Penrose pseudoinverse of , and the minimization ranges over all such that has full row rank
. A-optimal design, for instance, is statistically optimal for linear regression.
Finding an optimal solution for these design problems is NP-hard; and most discrete algorithms use local search . Avron and Boutsidis [6, Theorem 3.1] show that dual volume sampling yields an approximation guarantee for both A- and E-optimal design: if is sampled from , then
Avron and Boutsidis  provide a polynomial time sampling algorithm only for the case . Our algorithms achieve the bound (1.3) in expectation, and the derandomization in Section 2.3 achieves the bound deterministically. Wang et al.  recently (in parallel) achieved approximation bounds for A-optimality via a different algorithm combining convex relaxation and a greedy method. Other methods include leverage score sampling  and predictive length sampling .
Low-stretch spanning trees and applications. Objectives 1.1 also arise in the construction of low-stretch spanning trees, which have important applications in graph sparsification, preconditioning and solving symmetric diagonally dominant (SDD) linear systems , among others . In the node-edge incidence matrix of an undirected graph with nodes and edges, the column corresponding to edge is . Let be the SVD of with . The stretch of a spanning tree in is then given by . In those applications, we hence search for a set of edges with low stretch.
Network controllability. The problem of sampling columns in a matrix also arises in network controllability. For example, Zhao et al.  consider selecting control nodes
(under certain constraints) over time in complex networks to control a linear time-invariant network. After transforming the problem into a column subset selection problem from a short and wide controllability matrix, the objective becomes essentially an E-optimal design problem, for which the authors use greedy heuristics.
From a matrix with columns, we sample a set of columns (), where
. We denote the singular values ofby , in decreasing order. We will assume has full row rank , so . We also assume that for every where . By , we denote the -th elementary symmetric polynomial of , i.e., the -th coefficient of the characteristic polynomial .
2 Polynomial-time Dual Volume Sampling
We describe in this section our method to sample from the distribution . Our first method relies on the key insight that, as we show, the marginal probabilities for DVS can be computed in polynomial time. To demonstrate this, we begin with the partition function and then derive marginals.
The partition function has a conveniently simple closed form, which follows from the Cauchy-Binet formula and was also derived in .
Lemma 1 (Partition Function ).
For with and , we have
Next, we will need the marginal probability that a given set is a subset of the random set . In the following theorem, the set denotes the (set) complement of , and denotes the orthogonal complement of .
Theorem 2 (Marginals).
We prove Theorem 2 via a perturbation argument that connects DVS to volume sampling. Specifically, observe that for and it holds that
Carefully letting bridges volumes with “dual” volumes. The technical remainder of the proof further relates this equality to singular values, and exploits properties of characteristic polynomials. A similar argument yields an alternative proof of Lemma 1. We show the proofs in detail in Appendix A and B respectively.
The numerator of in Theorem 2 requires time to compute the first term, to compute the second and to compute the third. The denominator takes time, amounting in a total time of to compute the marginal probability.
The marginal probabilities derived above directly yield a polynomial-time exact DVS algorithm. Instead of -sets, we sample ordered -tuples . We denote the -tuple variant of the DVS distribution by :
Sampling is now straightforward. At the th step we sample via ; these probabilities are easily obtained from the marginals in Theorem 2.
Let , and as in Theorem 2. Then,
As a result, it is possible to draw an exact dual volume sample in time .
The full proof may be found in the appendix. The running time claim follows since the sampling algorithm invokes computations of marginal probabilities, each costing time.
A potentially more efficient approximate algorithm could be derived by noting the relations between volume sampling and DVS. Specifically, we add a small perturbation to DVS as in Equation 2.1 to transform it into a volume sampling problem, and apply random projection for more efficient volume sampling as in . Please refer to Appendix C for more details.
Next, we derandomize the above sampling algorithm to deterministically select a subset that satisfies the bound (1.3) for the Frobenius norm, thereby answering another question in . The key insight for derandomization is that conditional expectations can be computed in polynomial time, given the marginals in Theorem 2:
Let be such that the marginal distribution satisfies . The conditional expectation can be expressed as
where are the unnormalized marginal distributions, and it can be computed in time.
We show the full derivation in Appendix D.
Corollary 4 enables a greedy derandomization procedure. Starting with the empty tuple , in the th iteration, we greedily select and append it to our selection: . The final set is the non-ordered version of . Theorem 5 shows that this greedy procedure succeeds, and implies a deterministic version of the bound (1.3).
The greedy derandomization selects a column set satisfying
In the proof, we construct a greedy algorithm. In each iteration, the algorithm computes, for each column that has not yet been selected, the expectation conditioned on this column being included in the current set. Then it chooses the element with the lowest conditional expectation to actually be added to the current set. This greedy inclusion of elements will only decrease the conditional expectation, thus retaining the bound in Theorem 5. The detailed proof is deferred to Appendix E.
Complexity. Each iteration of the greedy selection requires to compute conditional expectations. Thus, the total running time for iterations is . The approximation bound for the spectral norm is slightly worse than that in (1.3), but is of the same order if .
3 Strong Rayleigh Property and Fast Markov Chain Sampling
Next, we investigate DVS more deeply and discover that it possesses a remarkable structural property, namely, the Strongly Rayleigh (SR)  property. This property has proved remarkably fruitful in a variety of recent contexts, including recent progress in approximation algorithms , fast sampling [2, 27], graph sparsification [22, 39], extensions to the Kadison-Singer problem , and certain concentration of measure results , among others.
For DVS, the SR property has two major consequences: it leads to a fast mixing practical MCMC sampler, and it implies results on concentration of measure.
Strongly Rayleigh measures.
SR measures were introduced in the landmark paper of Borcea et al. , who develop a rich theory of negatively associated measures. In particular, we say that a probability measure is negatively associated if for increasing functions on with disjoint support. This property reflects a “repelling” nature of , a property that occurs more broadly across probability, combinatorics, physics, and other fields—see [36, 8, 42] and references therein. The negative association property turns out to be quite subtle in general; the class of SR measures captures a strong notion of negative association and provides a framework for analyzing such measures.
Specifically, SR measures are defined via their connection to real stable polynomials [36, 8, 42]. A multivariate polynomial where is called real stable if all its coefficients are real and whenever for . A measure is called an SR measure if its multivariate generating polynomial is real stable. Notable examples of SR measures are Determinantal Point Processes [31, 29, 9, 26], balanced matroids [19, 37], Bernoullis conditioned on their sum, among others. It is known (see [8, pg. 523]) that the class of SR measures is exponentially larger than the class of determinantal measures.
3.1 Strong Rayleigh Property of DVS
Theorem 6 establishes the SR property for DVS and is the main result of this section. Here and in the following, we use the notation .
Let and . Then the multiaffine polynomial
is real stable. Consequently, is an SR measure.
The proof of Theorem 6 relies on key properties of real stable polynomials and SR measures established in . Essentially, the proof demonstrates that the generating polynomial of can be obtained by applying a few carefully chosen stability preserving operations to a polynomial that we know to be real stable. Stability, although easily destroyed, is closed under several operations noted in the important proposition below.
Proposition 7 (Prop. 2.1 ).
Let be a stable polynomial. The following properties preserve stability: (i) Substitution: for ; (ii) Differentiation: for any ; (iii) Diagonalization: is stable, and hence ; and (iv) Inversion: .
In addition, we need the following two propositions for proving Theorem 6.
Proposition 8 (Prop. 2.4 ).
Let be Hermitian, and ( be Hermitian semidefinite matrices. Then, the following polynomial is stable:
For and , we have .
Let be a diagonal matrix. Using the Cauchy-Binet identity we have
Thus, when , the (diagonal) indicator matrix for , we obtain . Consequently, in the summation above only terms with survive, yielding
We are now ready to sketch the proof of Theorem 6.
Proof. (Theorem 6)..
Notationally, it is more convenient to prove that the “complement” polynomial is stable; subsequently, an application of Prop. 7-(iv) yields stability of (3.1). Using matrix notation , , our starting stable polynomial (this stability follows from Prop. 8) is
which can be expanded as
Thus, is real stable in variables, indexed below by and where . Instead of the form above, We can sum over but then have to constrain the support to the case when and . In other words, we may write (using Iverson-brackets )
is also stable. Using Prop. 7-(iii), setting retains stability; thus
is also stable. Next, differentiating , times with respect to and evaluating at preserves stability (Prop. 7-(ii) and (i)). In doing so, only terms corresponding to survive, resulting in
which is just (up to a constant); here, the last equality follows from Prop. 9. This establishes stability of and hence of . Since is in addition multiaffine, it is the generating polynomial of an SR measure, completing the proof. ∎
3.2 Implications: MCMC
The SR property of established in Theorem 6 implies a fast mixing Markov chain for sampling . The states for the Markov chain are all sets of cardinality . The chain starts with a randomly-initialized active set , and in each iteration we swap an element with an element with a specific probability determined by the probability of the current and proposed set. The stationary distribution of this chain is the one induced by DVS, by a simple detailed-balance argument. The chain is shown in Algorithm 1.
The convergence of the markov chain is measured via its mixing time: The mixing time of the chain indicates the number of iterations that we must perform (starting from ) before we can consider as an approximately valid sample from . Formally, if is the total variation distance between the distribution of and after steps, then
is the mixing time to sample from a distribution -close to in terms of total variation distance. We say that the chain mixes fast if is polynomial in the problem size.
Theorem 10 (Mixing time).
The mixing time of Markov chain shown in Algorithm 1 is given by
To implement Algorithm 1 we need to compute the transition probabilities . Let and assume . By the matrix determinant lemma we have the acceptance ratio
Thus, the transition probabilities can be computed in time. Moreover, one can further accelerate this algorithm by using the quadrature techniques of  to compute lower and upper bounds on this acceptance ratio to determine early acceptance or rejection of the proposed move.
A remaining question is initialization. Since the mixing time involves , we need to start with such that is sufficiently bounded away from . We show in Appendix F that by a simple greedy algorithm, we are able to initialize such that , and the resulting running time for Algorithm 1 is , which is linear in the size of data set and is efficient when is not too large.
3.3 Further implications and connections
Algorithms for experimental design.
Widely used, classical algorithms for finding an approximate optimal design include Fedorov’s exchange algorithm [20, 21] (a greedy local search) and simulated annealing . Both methods start with a random initial set , and greedily or randomly exchange a column with a column . Apart from very expensive running times, they are known to work well in practice [35, 43]. Yet so far there is no theoretical analysis, or a principled way of determining when to stop the greedy search.
Curiously, our MCMC sampler is essentially a randomized version of Fedorov’s exchange method. The two methods can be connected by a unified, simulated annealing view, where we define with temperature parameter . Driving to zero essentially recovers Fedorov’s method, while our results imply fast mixing for , together with approximation guarantees. Through this lens, simulated annealing may be viewed as initializing Fedorov’s method with the fast-mixing sampler. In practice, we observe that letting improves the approximation results, which opens interesting questions for future work.
We report selection performance of DVS on real regression data (CompAct, CompAct(s), Abalone and Bank32NH111http://www.dcc.fc.up.pt/?ltorgo/Regression/DataSets.html
) for experimental design. We use 4,000 samples from each dataset for estimation. We compare against various baselines, including uniform sampling (Unif), leverage score sampling (Lev) , predictive length sampling (PL) , the sampling (Smpl)/greedy (Greedy) selection methods in  and Fedorov’s exchange algorithm . We initialize the MCMC sampler with Kmeans++  for DVS and run for 10,000 iterations, which empirically yields selections that are sufficiently good. We measure performances via (1) the prediction error ,
and 2) running times. Figure 1 shows the results for these three measures with sample sizes varying from to
. Further experiments (including for the interpolation), may be found in the appendix.
In terms of prediction error, DVS performs well and is comparable with Lev. Its strength compared to the greedy and relaxation methods (Smpl, Greedy, Fedorov) is running time, leading to good time-error tradeoffs. These tradeoffs are illustrated in Figure 1 for .
In other experiments (shown in Appendix G) we observed that in some cases, the optimization and greedy methods (Smpl, Greedy, Fedorov) yield better results than sampling, however with much higher running times. Hence, given time-error tradeoffs, DVS may be an interesting alternative in situations where time is a very limited resource and results are needed quickly.
In this paper, we study the problem of DVS and develop an exact (randomized) polynomial time sampling algorithm as well as its derandomization. We further study dual volume sampling via the theory of real-stable polynomials and prove that its distribution satisfies the “Strong Rayleigh” property. This result has remarkable consequences, especially because it implies a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. Finally, we observe connections to classical, computationally more expensive experimental design methods (Fedorov’s method and SA); together with our results here, these could be a first step towards a better theoretical understanding of those methods.
This research was supported by NSF CAREER award 1553284, NSF grant IIS-1409802, DARPA grant N66001-17-1-4039, DARPA FunLoL grant (W911NF-16-1-0551) and a Siebel Scholar Fellowship. The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.
- Anari and Gharan  N. Anari and S. O. Gharan. The Kadison-Singer problem for strongly Rayleigh measures and applications to asymmetric TSP. arXiv:1412.1143, 2014.
- Anari and Gharan  N. Anari and S. O. Gharan. Effective-resistance-reducing flows and asymmetric TSP. In IEEE Symposium on Foundations of Computer Science (FOCS), 2015.
- Anari et al.  N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In COLT, pages 23–26, 2016.
- Arioli and Duff  M. Arioli and I. S. Duff. Preconditioning of linear least-squares problems by identifying basic variables. SIAM J. Sci. Comput., 2015.
- Arthur and Vassilvitskii  D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
- Avron and Boutsidis  H. Avron and C. Boutsidis. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.
- Borcea and Brändén  J. Borcea and P. Brändén. Applications of stable polynomials to mixed determinants: Johnson’s conjectures, unimodality, and symmetrized Fischer products. Duke Mathematical Journal, pages 205–223, 2008.
- Borcea et al.  J. Borcea, P. Brändén, and T. Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22:521–567, 2009.
- Borodin  A. Borodin. Determinantal point processes. arXiv:0911.1153, 2009.
- Boutsidis and Magdon-Ismail  C. Boutsidis and M. Magdon-Ismail. Deterministic feature selection for k-means clustering. IEEE Transactions on Information Theory, pages 6099–6110, 2013.
- Boutsidis et al.  C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the column subset selection problem. In SODA, pages 968–977, 2009.
- Boutsidis et al.  C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas. Stochastic dimensionality reduction for k-means clustering. arXiv preprint arXiv:1110.2897, 2011.
- Boutsidis et al.  C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, pages 687–717, 2014.
- Chen et al.  S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević. Discrete signal processing on graphs: Sampling theory. IEEE Transactions on Signal Processing, 63(24):6510–6523, 2015.
- Çivril and Magdon-Ismail  A. Çivril and M. Magdon-Ismail. On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science, pages 4801–4811, 2009.
- Derezinski and Warmuth  M. Derezinski and M. K. Warmuth. Unbiased estimates for linear regression via volume sampling. Advances in Neural Information Processing Systems (NIPS), 2017.
- Deshpande and Rademacher  A. Deshpande and L. Rademacher. Efficient volume sampling for row/column subset selection. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 329–338. IEEE, 2010.
- Elkin et al.  M. Elkin, Y. Emek, D. A. Spielman, and S.-H. Teng. Lower-stretch spanning trees. SIAM Journal on Computing, 2008.
Feder and Mihail 
T. Feder and M. Mihail.
Symposium on Theory of Computing (STOC), pages 26–38, 1992.
- Fedorov  V. Fedorov. Theory of optimal experiments. Preprint 7 lsm, Moscow State University, 1969.
- Fedorov  V. Fedorov. Theory of optimal experiments. Academic Press, 1972.
- Frieze et al.  A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees. SIAM Journal on Computing, 43(2):497–513, 2014.
- Gharan et al.  S. O. Gharan, A. Saberi, and M. Singh. A randomized rounding approach to the traveling salesman problem. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 550–559, 2011.
- Halko et al.  N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- Joshi and Boyd  S. Joshi and S. Boyd. Sensor selection via convex optimization. IEEE Transactions on Signal Processing, pages 451–462, 2009.
Kulesza and Taskar 
A. Kulesza and B. Taskar.
Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
- Li et al. [2016a] C. Li, S. Jegelka, and S. Sra. Fast mixing markov chains for strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems (NIPS), 2016a.
- Li et al. [2016b] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications. In ICML, pages 1766–1775, 2016b.
- Lyons  R. Lyons. Determinantal probability measures. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 98(1):167–212, 2003.
- Ma et al.  P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In Journal of Machine Learning Research (JMLR), 2015.
- Macchi  O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1), 1975.
Magen and Zouzias 
A. Magen and A. Zouzias.
Near optimal dimensionality reductions that preserve volumes.
Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 523–534. Springer, 2008.
- Miller and Nguyen  A. J. Miller and N.-K. Nguyen. A fedorov exchange algorithm for d-optimal design. Journal of the royal statistical society, 1994.
- Morris and Mitchell  M. D. Morris and T. J. Mitchell. Exploratory designs for computational experiments. Journal of Statistical Planning and Inference, 43:381–402, 1995.
- Nguyen and Miller  N.-K. Nguyen and A. J. Miller. A review of some exchange algorithms for constructng discrete optimal designs. Computational Statistics and Data Analysis, 14:489–498, 1992.
- Pemantle  R. Pemantle. Towards a theory of negative dependence. Journal of Mathematical Physics, 41:1371–1390, 2000.
- Pemantle and Peres  R. Pemantle and Y. Peres. Concentration of Lipschitz functionals of determinantal and other strong Rayleigh measures. Combinatorics, Probability and Computing, 23:140–160, 2014.
- Pukelsheim  F. Pukelsheim. Optimal design of experiments. SIAM, 2006.
- Spielman and Srivastava  D. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011.
- Spielman and Teng  D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, 2004.
- Tsitsvero et al.  M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo. Signals on graphs: Uncertainty principle and sampling. IEEE Transactions on Signal Processing, 64(18):4845–4860, 2016.
- Wagner  D. Wagner. Multivariate stable polynomials: theory and applications. Bulletin of the American Mathematical Society, 48(1):53–84, 2011.
- Wang et al.  Y. Wang, A. W. Yu, and A. Singh. On Computationally Tractable Selection of Experiments in Regression Models. ArXiv e-prints, 2016.
- Zhao et al.  Y. Zhao, F. Pasqualetti, and J. Cortés. Scheduling of control nodes for improved network controllability. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1859–1864, 2016.
- Zhu et al.  R. Zhu, P. Ma, M. W. Mahoney, and B. Yu. Optimal subsampling approaches for large sample linear regression. arXiv preprint arXiv:1509.05111, 2015.
Appendix A Partition Function
We recall two easily verified facts about determinants that will be useful in our analysis:
The first one is known as matrix determinant lemma.
The partition function of , happens to have a pleasant closed-form formula. Although this formula is known , and follows immediately by an application of the Cauchy-Binet identity, we present an alternative proof based on the perturbation argument for its conceptual value and subsequent use.
Theorem 11 (Partition Function ).
Given where and , we have
First note that for and any , by (A.2) we have
Taking limits as on both sides we have
Let us focus on
. We construct an identity matrix, then we have
In other words, this value is proportional to the probability of sampling columns from using volume sampling. Therefore, using the definition of we have
Now taking the limit as we obtain
Appendix B Marginal Probability
The marginal probability of a set for dual volume sampling is
Theorem 11 shows how to compute the denominator, thus our main effort is devoted to the nominator. We have
Using the -trick we have
By decomposing we have
Now we let be the singular value decomposition of where , and . Plugging the decomposition in the equation we obtain