1 Introduction
A variety of applications share the core task of selecting a subset of columns from a short, wide matrix with rows and columns. The criteria for selecting these columns typically aim at preserving information about the span of while generating a wellconditioned submatrix. Classical and recent examples include experimental design, where we select observations or experiments [38]; preconditioning for solving linear systems and constructing lowstretch spanning trees (here is a version of the nodeedge incidence matrix and we select edges in a graph) [6, 4]; matrix approximation [11, 13, 24]
; feature selection in
means clustering [10, 12]; sensor selection [25] and graph signal processing [14, 41].In this work, we study a randomized approach that holds promise for all of these applications. This approach relies on sampling columns of
according to a probability distribution defined over its submatrices: the probability of selecting a set
of columns from , with , is(1.1) 
where is the submatrix consisting of the selected columns. This distribution is reminiscent of volume sampling, where columns are selected with probability proportional to the determinant of a matrix, i.e., the squared volume of the parallelepiped spanned by the selected columns. (Volume sampling does not apply to as the involved determinants vanish.) In contrast, uses the determinant of an matrix and uses the volume spanned by the rows formed by the selected columns. Hence we refer to sampling as dual volume sampling (DVS).
Contributions.
Despite the ostensible similarity between volume sampling and DVS, and despite the many practical implications of DVS outlined below, efficient algorithms for DVS are not known and were raised as open questions in [6]. In this work, we make two key contributions:

We develop polynomialtime randomized sampling algorithms and their derandomization for DVS. Surprisingly, our proofs require only elementary (but involved) matrix manipulations.

We establish that is a Strongly Rayleigh measure [8], a remarkable property that captures a specific form of negative dependence. Our proof relies on the theory of real stable polynomials, and the ensuing result implies a provably fastmixing, practical MCMC sampler. Moreover, this result implies concentration properties for dual volume sampling.
In parallel with our work, [16] also proposed a polynomial time sampling algorithm that works efficiently in practice. Our work goes on to further uncover the hitherto unknown “Strong Rayleigh” property of DVS, which has important consequences, including those noted above.
1.1 Connections and implications.
The selection of columns from a short and wide matrix has many applications. Our algorithms for DVS hence have several implications and connections; we note a few below.
Experimental design. The theory of optimal experiment design explores several criteria for selecting the set of columns (experiments) . Popular choices are
(1.2) 
Here, denotes the MoorePenrose pseudoinverse of , and the minimization ranges over all such that has full row rank
. Aoptimal design, for instance, is statistically optimal for linear regression
[38].Finding an optimal solution for these design problems is NPhard; and most discrete algorithms use local search [33]. Avron and Boutsidis [6, Theorem 3.1] show that dual volume sampling yields an approximation guarantee for both A and Eoptimal design: if is sampled from , then
(1.3) 
Avron and Boutsidis [6] provide a polynomial time sampling algorithm only for the case . Our algorithms achieve the bound (1.3) in expectation, and the derandomization in Section 2.3 achieves the bound deterministically. Wang et al. [43] recently (in parallel) achieved approximation bounds for Aoptimality via a different algorithm combining convex relaxation and a greedy method. Other methods include leverage score sampling [30] and predictive length sampling [45].
Lowstretch spanning trees and applications. Objectives 1.1 also arise in the construction of lowstretch spanning trees, which have important applications in graph sparsification, preconditioning and solving symmetric diagonally dominant (SDD) linear systems [40], among others [18]. In the nodeedge incidence matrix of an undirected graph with nodes and edges, the column corresponding to edge is . Let be the SVD of with . The stretch of a spanning tree in is then given by [6]. In those applications, we hence search for a set of edges with low stretch.
Network controllability. The problem of sampling columns in a matrix also arises in network controllability. For example, Zhao et al. [44] consider selecting control nodes
(under certain constraints) over time in complex networks to control a linear timeinvariant network. After transforming the problem into a column subset selection problem from a short and wide controllability matrix, the objective becomes essentially an Eoptimal design problem, for which the authors use greedy heuristics.
Notation.
From a matrix with columns, we sample a set of columns (), where
. We denote the singular values of
by , in decreasing order. We will assume has full row rank , so . We also assume that for every where . By , we denote the th elementary symmetric polynomial of , i.e., the th coefficient of the characteristic polynomial .2 Polynomialtime Dual Volume Sampling
We describe in this section our method to sample from the distribution . Our first method relies on the key insight that, as we show, the marginal probabilities for DVS can be computed in polynomial time. To demonstrate this, we begin with the partition function and then derive marginals.
2.1 Marginals
The partition function has a conveniently simple closed form, which follows from the CauchyBinet formula and was also derived in [6].
Lemma 1 (Partition Function [6]).
For with and , we have
Next, we will need the marginal probability that a given set is a subset of the random set . In the following theorem, the set denotes the (set) complement of , and denotes the orthogonal complement of .
Theorem 2 (Marginals).
Let , , and . Let
be the singular value decomposition of
where , and . Further define the matricesLet
be the eigenvalue decomposition of
where . Moreover, let and . Then the marginal probability of in DVS isWe prove Theorem 2 via a perturbation argument that connects DVS to volume sampling. Specifically, observe that for and it holds that
(2.1) 
Carefully letting bridges volumes with “dual” volumes. The technical remainder of the proof further relates this equality to singular values, and exploits properties of characteristic polynomials. A similar argument yields an alternative proof of Lemma 1. We show the proofs in detail in Appendix A and B respectively.
Complexity.
The numerator of in Theorem 2 requires time to compute the first term, to compute the second and to compute the third. The denominator takes time, amounting in a total time of to compute the marginal probability.
2.2 Sampling
The marginal probabilities derived above directly yield a polynomialtime exact DVS algorithm. Instead of sets, we sample ordered tuples . We denote the tuple variant of the DVS distribution by :
Sampling is now straightforward. At the th step we sample via ; these probabilities are easily obtained from the marginals in Theorem 2.
Corollary 3.
Let , and as in Theorem 2. Then,
As a result, it is possible to draw an exact dual volume sample in time .
The full proof may be found in the appendix. The running time claim follows since the sampling algorithm invokes computations of marginal probabilities, each costing time.
Remark
A potentially more efficient approximate algorithm could be derived by noting the relations between volume sampling and DVS. Specifically, we add a small perturbation to DVS as in Equation 2.1 to transform it into a volume sampling problem, and apply random projection for more efficient volume sampling as in [17]. Please refer to Appendix C for more details.
2.3 Derandomization
Next, we derandomize the above sampling algorithm to deterministically select a subset that satisfies the bound (1.3) for the Frobenius norm, thereby answering another question in [6]. The key insight for derandomization is that conditional expectations can be computed in polynomial time, given the marginals in Theorem 2:
Corollary 4.
Let be such that the marginal distribution satisfies . The conditional expectation can be expressed as
where are the unnormalized marginal distributions, and it can be computed in time.
We show the full derivation in Appendix D.
Corollary 4 enables a greedy derandomization procedure. Starting with the empty tuple , in the th iteration, we greedily select and append it to our selection: . The final set is the nonordered version of . Theorem 5 shows that this greedy procedure succeeds, and implies a deterministic version of the bound (1.3).
Theorem 5.
The greedy derandomization selects a column set satisfying
In the proof, we construct a greedy algorithm. In each iteration, the algorithm computes, for each column that has not yet been selected, the expectation conditioned on this column being included in the current set. Then it chooses the element with the lowest conditional expectation to actually be added to the current set. This greedy inclusion of elements will only decrease the conditional expectation, thus retaining the bound in Theorem 5. The detailed proof is deferred to Appendix E.
Complexity. Each iteration of the greedy selection requires to compute conditional expectations. Thus, the total running time for iterations is . The approximation bound for the spectral norm is slightly worse than that in (1.3), but is of the same order if .
3 Strong Rayleigh Property and Fast Markov Chain Sampling
Next, we investigate DVS more deeply and discover that it possesses a remarkable structural property, namely, the Strongly Rayleigh (SR) [8] property. This property has proved remarkably fruitful in a variety of recent contexts, including recent progress in approximation algorithms [23], fast sampling [2, 27], graph sparsification [22, 39], extensions to the KadisonSinger problem [1], and certain concentration of measure results [37], among others.
For DVS, the SR property has two major consequences: it leads to a fast mixing practical MCMC sampler, and it implies results on concentration of measure.
Strongly Rayleigh measures.
SR measures were introduced in the landmark paper of Borcea et al. [8], who develop a rich theory of negatively associated measures. In particular, we say that a probability measure is negatively associated if for increasing functions on with disjoint support. This property reflects a “repelling” nature of , a property that occurs more broadly across probability, combinatorics, physics, and other fields—see [36, 8, 42] and references therein. The negative association property turns out to be quite subtle in general; the class of SR measures captures a strong notion of negative association and provides a framework for analyzing such measures.
Specifically, SR measures are defined via their connection to real stable polynomials [36, 8, 42]. A multivariate polynomial where is called real stable if all its coefficients are real and whenever for . A measure is called an SR measure if its multivariate generating polynomial is real stable. Notable examples of SR measures are Determinantal Point Processes [31, 29, 9, 26], balanced matroids [19, 37], Bernoullis conditioned on their sum, among others. It is known (see [8, pg. 523]) that the class of SR measures is exponentially larger than the class of determinantal measures.
3.1 Strong Rayleigh Property of DVS
Theorem 6 establishes the SR property for DVS and is the main result of this section. Here and in the following, we use the notation .
Theorem 6.
Let and . Then the multiaffine polynomial
(3.1) 
is real stable. Consequently, is an SR measure.
The proof of Theorem 6 relies on key properties of real stable polynomials and SR measures established in [8]. Essentially, the proof demonstrates that the generating polynomial of can be obtained by applying a few carefully chosen stability preserving operations to a polynomial that we know to be real stable. Stability, although easily destroyed, is closed under several operations noted in the important proposition below.
Proposition 7 (Prop. 2.1 [8]).
Let be a stable polynomial. The following properties preserve stability: (i) Substitution: for ; (ii) Differentiation: for any ; (iii) Diagonalization: is stable, and hence ; and (iv) Inversion: .
In addition, we need the following two propositions for proving Theorem 6.
Proposition 8 (Prop. 2.4 [7]).
Let be Hermitian, and ( be Hermitian semidefinite matrices. Then, the following polynomial is stable:
(3.2) 
Proposition 9.
For and , we have .
Proof.
Let be a diagonal matrix. Using the CauchyBinet identity we have
Thus, when , the (diagonal) indicator matrix for , we obtain . Consequently, in the summation above only terms with survive, yielding
We are now ready to sketch the proof of Theorem 6.
Proof. (Theorem 6)..
Notationally, it is more convenient to prove that the “complement” polynomial is stable; subsequently, an application of Prop. 7(iv) yields stability of (3.1). Using matrix notation , , our starting stable polynomial (this stability follows from Prop. 8) is
which can be expanded as
Thus, is real stable in variables, indexed below by and where . Instead of the form above, We can sum over but then have to constrain the support to the case when and . In other words, we may write (using Iversonbrackets )
(3.3) 
Next, we truncate polynomial (3.3) at degree by restricting . By [8, Corollary 4.18] this truncation preserves stability, whence
is also stable. Using Prop. 7(iii), setting retains stability; thus
is also stable. Next, differentiating , times with respect to and evaluating at preserves stability (Prop. 7(ii) and (i)). In doing so, only terms corresponding to survive, resulting in
which is just (up to a constant); here, the last equality follows from Prop. 9. This establishes stability of and hence of . Since is in addition multiaffine, it is the generating polynomial of an SR measure, completing the proof. ∎
3.2 Implications: MCMC
The SR property of established in Theorem 6 implies a fast mixing Markov chain for sampling . The states for the Markov chain are all sets of cardinality . The chain starts with a randomlyinitialized active set , and in each iteration we swap an element with an element with a specific probability determined by the probability of the current and proposed set. The stationary distribution of this chain is the one induced by DVS, by a simple detailedbalance argument. The chain is shown in Algorithm 1.
The convergence of the markov chain is measured via its mixing time: The mixing time of the chain indicates the number of iterations that we must perform (starting from ) before we can consider as an approximately valid sample from . Formally, if is the total variation distance between the distribution of and after steps, then
is the mixing time to sample from a distribution close to in terms of total variation distance. We say that the chain mixes fast if is polynomial in the problem size.
The fast mixing result for Algorithm 1 is a corollary of Theorem 6 combined with a recent result of [3] on fastmixing Markov chains for homogeneous SR measures. Theorem 10 states this precisely.
Theorem 10 (Mixing time).
The mixing time of Markov chain shown in Algorithm 1 is given by
Proof.
Implementation.
To implement Algorithm 1 we need to compute the transition probabilities . Let and assume . By the matrix determinant lemma we have the acceptance ratio
Thus, the transition probabilities can be computed in time. Moreover, one can further accelerate this algorithm by using the quadrature techniques of [28] to compute lower and upper bounds on this acceptance ratio to determine early acceptance or rejection of the proposed move.
Initialization.
A remaining question is initialization. Since the mixing time involves , we need to start with such that is sufficiently bounded away from . We show in Appendix F that by a simple greedy algorithm, we are able to initialize such that , and the resulting running time for Algorithm 1 is , which is linear in the size of data set and is efficient when is not too large.
3.3 Further implications and connections
Concentration.
Algorithms for experimental design.
Widely used, classical algorithms for finding an approximate optimal design include Fedorov’s exchange algorithm [20, 21] (a greedy local search) and simulated annealing [34]. Both methods start with a random initial set , and greedily or randomly exchange a column with a column . Apart from very expensive running times, they are known to work well in practice [35, 43]. Yet so far there is no theoretical analysis, or a principled way of determining when to stop the greedy search.
Curiously, our MCMC sampler is essentially a randomized version of Fedorov’s exchange method. The two methods can be connected by a unified, simulated annealing view, where we define with temperature parameter . Driving to zero essentially recovers Fedorov’s method, while our results imply fast mixing for , together with approximation guarantees. Through this lens, simulated annealing may be viewed as initializing Fedorov’s method with the fastmixing sampler. In practice, we observe that letting improves the approximation results, which opens interesting questions for future work.
4 Experiments
We report selection performance of DVS on real regression data (CompAct, CompAct(s), Abalone and Bank32NH^{1}^{1}1http://www.dcc.fc.up.pt/?ltorgo/Regression/DataSets.html
) for experimental design. We use 4,000 samples from each dataset for estimation. We compare against various baselines, including uniform sampling (
Unif), leverage score sampling (Lev) [30], predictive length sampling (PL) [45], the sampling (Smpl)/greedy (Greedy) selection methods in [43] and Fedorov’s exchange algorithm [20]. We initialize the MCMC sampler with Kmeans++ [5] for DVS and run for 10,000 iterations, which empirically yields selections that are sufficiently good. We measure performances via (1) the prediction error ,and 2) running times. Figure 1 shows the results for these three measures with sample sizes varying from to
. Further experiments (including for the interpolation
), may be found in the appendix.In terms of prediction error, DVS performs well and is comparable with Lev. Its strength compared to the greedy and relaxation methods (Smpl, Greedy, Fedorov) is running time, leading to good timeerror tradeoffs. These tradeoffs are illustrated in Figure 1 for .
In other experiments (shown in Appendix G) we observed that in some cases, the optimization and greedy methods (Smpl, Greedy, Fedorov) yield better results than sampling, however with much higher running times. Hence, given timeerror tradeoffs, DVS may be an interesting alternative in situations where time is a very limited resource and results are needed quickly.
5 Conclusion
In this paper, we study the problem of DVS and develop an exact (randomized) polynomial time sampling algorithm as well as its derandomization. We further study dual volume sampling via the theory of realstable polynomials and prove that its distribution satisfies the “Strong Rayleigh” property. This result has remarkable consequences, especially because it implies a provably fastmixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. Finally, we observe connections to classical, computationally more expensive experimental design methods (Fedorov’s method and SA); together with our results here, these could be a first step towards a better theoretical understanding of those methods.
Acknowledgement
This research was supported by NSF CAREER award 1553284, NSF grant IIS1409802, DARPA grant N660011714039, DARPA FunLoL grant (W911NF1610551) and a Siebel Scholar Fellowship. The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.
References
 Anari and Gharan [2014] N. Anari and S. O. Gharan. The KadisonSinger problem for strongly Rayleigh measures and applications to asymmetric TSP. arXiv:1412.1143, 2014.
 Anari and Gharan [2015] N. Anari and S. O. Gharan. Effectiveresistancereducing flows and asymmetric TSP. In IEEE Symposium on Foundations of Computer Science (FOCS), 2015.
 Anari et al. [2016] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In COLT, pages 23–26, 2016.
 Arioli and Duff [2015] M. Arioli and I. S. Duff. Preconditioning of linear leastsquares problems by identifying basic variables. SIAM J. Sci. Comput., 2015.
 Arthur and Vassilvitskii [2007] D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, 2007.
 Avron and Boutsidis [2013] H. Avron and C. Boutsidis. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.
 Borcea and Brändén [2008] J. Borcea and P. Brändén. Applications of stable polynomials to mixed determinants: Johnson’s conjectures, unimodality, and symmetrized Fischer products. Duke Mathematical Journal, pages 205–223, 2008.
 Borcea et al. [2009] J. Borcea, P. Brändén, and T. Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22:521–567, 2009.
 Borodin [2009] A. Borodin. Determinantal point processes. arXiv:0911.1153, 2009.
 Boutsidis and MagdonIsmail [2013] C. Boutsidis and M. MagdonIsmail. Deterministic feature selection for kmeans clustering. IEEE Transactions on Information Theory, pages 6099–6110, 2013.
 Boutsidis et al. [2009] C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the column subset selection problem. In SODA, pages 968–977, 2009.
 Boutsidis et al. [2011] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas. Stochastic dimensionality reduction for kmeans clustering. arXiv preprint arXiv:1110.2897, 2011.
 Boutsidis et al. [2014] C. Boutsidis, P. Drineas, and M. MagdonIsmail. Nearoptimal columnbased matrix reconstruction. SIAM Journal on Computing, pages 687–717, 2014.
 Chen et al. [2015] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević. Discrete signal processing on graphs: Sampling theory. IEEE Transactions on Signal Processing, 63(24):6510–6523, 2015.
 Çivril and MagdonIsmail [2009] A. Çivril and M. MagdonIsmail. On selecting a maximum volume submatrix of a matrix and related problems. Theoretical Computer Science, pages 4801–4811, 2009.
 Derezinski and Warmuth [2017] M. Derezinski and M. K. Warmuth. Unbiased estimates for linear regression via volume sampling. Advances in Neural Information Processing Systems (NIPS), 2017.
 Deshpande and Rademacher [2010] A. Deshpande and L. Rademacher. Efficient volume sampling for row/column subset selection. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 329–338. IEEE, 2010.
 Elkin et al. [2008] M. Elkin, Y. Emek, D. A. Spielman, and S.H. Teng. Lowerstretch spanning trees. SIAM Journal on Computing, 2008.

Feder and Mihail [1992]
T. Feder and M. Mihail.
Balanced matroids.
In
Symposium on Theory of Computing (STOC)
, pages 26–38, 1992.  Fedorov [1969] V. Fedorov. Theory of optimal experiments. Preprint 7 lsm, Moscow State University, 1969.
 Fedorov [1972] V. Fedorov. Theory of optimal experiments. Academic Press, 1972.
 Frieze et al. [2014] A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees. SIAM Journal on Computing, 43(2):497–513, 2014.
 Gharan et al. [2011] S. O. Gharan, A. Saberi, and M. Singh. A randomized rounding approach to the traveling salesman problem. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 550–559, 2011.
 Halko et al. [2011] N. Halko, P.G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
 Joshi and Boyd [2009] S. Joshi and S. Boyd. Sensor selection via convex optimization. IEEE Transactions on Signal Processing, pages 451–462, 2009.

Kulesza and Taskar [2012]
A. Kulesza and B. Taskar.
Determinantal Point Processes for machine learning
, volume 5. Foundations and Trends in Machine Learning, 2012.  Li et al. [2016a] C. Li, S. Jegelka, and S. Sra. Fast mixing markov chains for strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems (NIPS), 2016a.
 Li et al. [2016b] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications. In ICML, pages 1766–1775, 2016b.
 Lyons [2003] R. Lyons. Determinantal probability measures. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 98(1):167–212, 2003.
 Ma et al. [2015] P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In Journal of Machine Learning Research (JMLR), 2015.
 Macchi [1975] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1), 1975.

Magen and Zouzias [2008]
A. Magen and A. Zouzias.
Near optimal dimensionality reductions that preserve volumes.
In
Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques
, pages 523–534. Springer, 2008.  Miller and Nguyen [1994] A. J. Miller and N.K. Nguyen. A fedorov exchange algorithm for doptimal design. Journal of the royal statistical society, 1994.
 Morris and Mitchell [1995] M. D. Morris and T. J. Mitchell. Exploratory designs for computational experiments. Journal of Statistical Planning and Inference, 43:381–402, 1995.
 Nguyen and Miller [1992] N.K. Nguyen and A. J. Miller. A review of some exchange algorithms for constructng discrete optimal designs. Computational Statistics and Data Analysis, 14:489–498, 1992.
 Pemantle [2000] R. Pemantle. Towards a theory of negative dependence. Journal of Mathematical Physics, 41:1371–1390, 2000.
 Pemantle and Peres [2014] R. Pemantle and Y. Peres. Concentration of Lipschitz functionals of determinantal and other strong Rayleigh measures. Combinatorics, Probability and Computing, 23:140–160, 2014.
 Pukelsheim [2006] F. Pukelsheim. Optimal design of experiments. SIAM, 2006.
 Spielman and Srivastava [2011] D. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011.
 Spielman and Teng [2004] D. A. Spielman and S.H. Teng. Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, 2004.
 Tsitsvero et al. [2016] M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo. Signals on graphs: Uncertainty principle and sampling. IEEE Transactions on Signal Processing, 64(18):4845–4860, 2016.
 Wagner [2011] D. Wagner. Multivariate stable polynomials: theory and applications. Bulletin of the American Mathematical Society, 48(1):53–84, 2011.
 Wang et al. [2016] Y. Wang, A. W. Yu, and A. Singh. On Computationally Tractable Selection of Experiments in Regression Models. ArXiv eprints, 2016.
 Zhao et al. [2016] Y. Zhao, F. Pasqualetti, and J. Cortés. Scheduling of control nodes for improved network controllability. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1859–1864, 2016.
 Zhu et al. [2015] R. Zhu, P. Ma, M. W. Mahoney, and B. Yu. Optimal subsampling approaches for large sample linear regression. arXiv preprint arXiv:1509.05111, 2015.
Appendix A Partition Function
We recall two easily verified facts about determinants that will be useful in our analysis:
(A.1)  
(A.2) 
The first one is known as matrix determinant lemma.
The partition function of , happens to have a pleasant closedform formula. Although this formula is known [6], and follows immediately by an application of the CauchyBinet identity, we present an alternative proof based on the perturbation argument for its conceptual value and subsequent use.
Theorem 11 (Partition Function [6]).
Given where and , we have
(A.3) 
Proof.
First note that for and any , by (A.2) we have
Taking limits as on both sides we have
Let us focus on
. We construct an identity matrix
, then we have(A.4) 
In other words, this value is proportional to the probability of sampling columns from using volume sampling. Therefore, using the definition of we have
Now taking the limit as we obtain
∎
Appendix B Marginal Probability
Proof.
The marginal probability of a set for dual volume sampling is
Theorem 11 shows how to compute the denominator, thus our main effort is devoted to the nominator. We have
Using the trick we have
By decomposing we have
Now we let be the singular value decomposition of where , and . Plugging the decomposition in the equation we obtain