The matrix scaling problem asks to scale each row and column of a given matrix by a positive number in such a way that the resulting matrix has marginals (i.e., row- and column-sums) that are close to some prescribed marginals. For example, one could ask to scale the matrix in such a way that it becomes doubly stochastic.
Matrix scaling has applications in a wide variety of areas including numerical linear algebra [ABB99]
, optimal transport in machine learning[Cut13], statistics [Kru37, DS40, Bro59, BFH75], and also in more theoretical settings, e.g. for approximating the permanent [LSW00]. For a survey, we refer the reader to [Ide16]
. Furthermore, the matrix scaling problem is a special (commutative) instance of a more general (non-commutative) class of problems, which includes operator and tensor scaling; these problems have many more applications and are a topic of much recent interest[GGOW19, BFG19].
Formally, the matrix scaling problem is defined for the -norm as follows. Given a matrix with at most non-zero entries, entrywise-positive target marginals with , and a parameter , find vectors such that the (rescaled) matrix satisfies
Here is the vector of row-marginals of the matrix and similarly is the vector of column-marginals. We refer to and as the scaling vectors, whereas and are called scaling factors. A common choice of target marginals is , i.e., every row and column sum target is , and we refer to these as the uniform target marginals. As is standard in the matrix scaling literature, we will henceforth assume that is asymptotically -scalable: for every , there exist such that satisfies Eq. 1.1. This depends only on the support of [RS89, Thm. 3], and is the case if and only if is in the convex hull of the points such that , where the are the standard basis vectors for . We will also always assume that the smallest non-zero entry of each of , and is at least .
Many classical algorithms for the matrix scaling problem can be viewed from the perspective of convex optimization. For example, one can solve the matrix scaling problem by minimizing the convex (potential) function
where denotes the standard inner product on . The popular and practical Sinkhorn algorithm [Sin64] – which alternates between rescaling the rows and columns to the desired marginals – can be viewed as a (block-)coordinate descent algorithm on , i.e., a first-order method. Given its simplicity, it is no wonder that it has been rediscovered in many settings, and is known by many names, such as the RAS algorithm, iterative proportional fitting, or raking.
It is known that the iterates in the Sinkhorn algorithm converge to a -scaled matrix whenever is asymptotically -scalable. The convergence rate of Sinkhorn’s algorithm is known in various settings, and we give a brief overview of the (classical) time complexity of finding an --scaling, noting that a single iteration can be implemented in time . When is entrywise positive then one can scale in time [vAGL21]; in the -setting for uniform target marginals a similar result can be found in [KK93, KLRS08]. In the general setting where has at most non-zero entries the complexity becomes (for arbitrary target marginals ); a proof may be found in [ANWR17] for the entrywise-positive case, [CK20] for exactly scalable matrices (i.e., where the problem can be solved for ) and [vAGL21] for asymptotically scalable matrices.
While simple, the Sinkhorn algorithm is by no means the fastest when the parameter is small. The classical state-of-the-art algorithms are based on second-order methods such as (traditional) interior point methods or so-called box-constrained Newton methods [CMTV17, AZLOW17], the latter of which we describe in more detail below. We note that these algorithms depend on fast algorithms for graph sparsification and Laplacian system solving, so are rather complicated compared to Sinkhorn’s algorithm. The box-constrained Newton methods can find --scaling vectors in time , where the hides polylogarithmic factors in and , and is a certain diameter bound (made precise later in the introduction). For entrywise-positive matrices, is of size , and in general it is known to be [AZLOW17, Lem. 3.3]. Alternatively, the interior-point method of [CMTV17] has a time complexity of , which is better than the box-constrained Newton method for general inputs, but worse for entrywise-positive matrices.
Recently, a quantum algorithm for matrix scaling was developed based on Sinkhorn’s algorithm [vAGL21], giving --scaling vectors in time for general matrices or for entrywise-positive matrices. This improves the dependence on and at the cost of a higher dependence on when compared to the classical Sinkhorn algorithm (which we recall runs in or for entrywise-positive matrices). Furthermore, it was shown that this quantum algorithm is optimal for (sufficiently small) constant : there exists an (independent of ) such that every quantum algorithm that --scales to uniform target marginals with probability at least must make at least queries. It was left as an open problem whether one can also obtain quantum speedups (in terms of or ) using second-order methods. In this work we give improved quantum lower and upper bounds on the complexity of matrix scaling. We first prove a lower bound: we show that every quantum algorithm that solves the matrix scaling problem for small enough must make a number of queries proportional to the number of non-zero entries in the matrix, even when the success probability of the algorithm is only assumed to be exponentially small. This shows that one cannot hope to get a quantum algorithm for matrix scaling with a polylogarithmic -dependence and sublinear dependence on . However, this does not rule out that second-order methods can be useful in the quantum setting. Indeed, we give a quantum box-constrained Newton method which has a better -dependence than the previously mentioned quantum Sinkhorn algorithm, and in certain settings is strictly better, such as for entrywise-positive instances.
1.1 Lower bounds
As previously mentioned, we show for entrywise-positive instances that a polynomial -dependence is necessary for a scaling algorithm whose -dependence is for a constant . More precisely, we prove the following theorem (which we extend to an -lower bound in the general setting of non-zero entries in Corollary 2.16):
There exists a constant such that every matrix scaling algorithm that, with probability , finds scaling vectors for entrywise-positive -matrices with -error must make at least queries to the matrix. This even holds for uniform targets and matrices with smallest entry .
The proof of this lower bound is based on a reduction from deciding whether bit strings have Hamming weight or . Specifically, given bit strings for , each with Hamming weight where , we show that any matrix scaling algorithm can be used to determine all the . One can show that every quantum algorithm that computes all the ’s needs to make quantum queries to the bit string , even if the algorithm has only exponentially small success probability: to determine a single with success probability at least , one needs to make quantum queries to the bit string [BBC01, NW99, Amb02], and one can use the strong direct product theorem of Lee and Roland [LR13] to prove the lower bound for computing all ’s simultaneously. To convert the problem of computing the to an instance of matrix scaling, one constructs a matrix whose first rows are (roughly) given by the vectors for some , and whose last rows are given by . For such an , the column sums are all , and the row sums are determined by the . If the matrix obtained by a single Sinkhorn step from (i.e., rescaling all the rows) were exactly column scaled, then the optimal scaling factors encode the . We show that, if one randomly (independently for each ) permutes the beforehand, this is approximately the case: the column sums of this will be close to the desired column sums with high probability, and hence the first step of Sinkhorn gives approximately optimal scaling factors (which encode the ). Then, we give a lower bound on the strong convexity parameter of the potential , to show that all sufficiently precise minimizers of also encode the . In other words, from sufficiently precise scaling factors, we can recover the , yielding the reduction to matrix scaling, and consequently a lower bound for the matrix scaling problem.
We additionally study the problem of computing an --approximation of the vector of row sums of an -normalized matrix . This is a common subroutine for matrix scaling algorithms; for instance, the gradient of the potential function from (1.2) that we optimize for the upper bound can be determined from the row and column sums by subtracting the desired row and column sums, so the complexity of this subroutine directly relates to the complexity of each iteration in our algorithm. We give the following lower bound for this problem.
Theorem 1.2 (Informal).
For and an -normalized matrix , computing an --approximation of takes queries to . Moreover, there exists a constant such that computing an --approximation of takes queries to .
The first lower bound in the theorem is proven in Theorem 2.17. Its proof is based on a reduction from independent instances of the majority problem, as for the lower bound for matrix scaling. The second lower bound can be derived from the lower bound for matrix scaling given in [vAGL21]: using a constant number of calls to a subroutine that provides constant-precision approximations to the row- and column-sum vectors, one can implement Sinkhorn’s algorithm to find a constant-precision -scaling, which for a small enough constant takes queries. Hence, there exists a constant (independent of ) such that computing an --approximation of takes at least queries to the matrix entries.
1.2 Upper bounds
While the first lower bound (Theorem 1.1) shows that a (quantum) algorithm for matrix scaling cannot have both an -dependence for and a polylogarithmic -dependence, one can still hope to obtain a second-order -time algorithm with a better -dependence than the quantum Sinkhorn algorithm of [vAGL21]. We show that one can build on a box-constrained Newton method [CMTV17, AZLOW17] to obtain a quantum algorithm which achieves this, at the cost of depending quadratically on a certain diameter bound ; recall for comparison that the classical box-constrained Newton methods run in time . For general matrices, one has the bound [AZLOW17, Lem. 3.3]. The performance of the resulting quantum box-constrained Newton method is summarized in the following theorem:
Theorem 1.3 (Informal version of Corollaries 3.16 and 3.15).
For asymptotically-scalable matrices with non-zero entries and target marginals , one can find such that is --scaled to in quantum time where is the -norm of at least one -minimizer of . When is entrywise positive we have , so the algorithm runs in quantum time .
We emphasize that the diameter bound does not need to be provided as an input to the algorithm. Note that for entrywise-positive matrices, the algorithm improves over the quantum Sinkhorn method, which runs in time .
Let us give a sketch of the box-constrained method that we use, see Section 3.1 for details. The algorithm aims to minimize the (highly structured) convex potential function from Eq. 1.2. A natural iterative method for minimizing convex functions is to minimize in each iteration the quadratic Taylor expansion of the function at the current iterate. A box-constrained method constrains the minimization of the quadratic Taylor expansion to those that lie in an -ball of radius around the current iterate (hence the name):
This is guaranteed to decrease a convex function whenever it is second-order robust, i.e., whenever the Hessian of at a point is a good multiplicative approximation of the Hessian at every other point in a constant-radius -ball. One can show that the steps taken decrease the potential gap by a multiplicative factor which depends on the distance to the minimizer.
One then observes that the function from Eq. 1.2 is second-order robust. Moreover, its Hessian has an exceptionally nice structure: given by
it is similar to a Laplacian matrix. This means that the key subroutine in this method (approximately) minimizes quadratic forms over -balls, where is a Laplacian matrix; without the -constraint, this amounts to solving the Laplacian system . Such a subroutine can be implemented for the more general class of symmetric diagonally-dominant matrices (with non-positive off-diagonal entries) on a classical computer in (almost) linear time in the number of non-zero entries of [CMTV17]. For technical reasons, one has to add a regularization term to , and the regularized potential instead has a symmetric diagonally-dominant Hessian structure. Given the recent quantum algorithm for graph sparsification and Laplacian system solving of Apers and de Wolf [AdW20], one would therefore hope to obtain a quantum speedup for the box-constrained Newton method. We show that one can indeed achieve this by first using the quantum algorithm for graph sparsification, and then using the classical method for the minimization procedure. We note, however, that in order to achieve a quantum speedup in terms of and , we incur a polynomial dependence in the time complexity on the precision with which we can approximate and (as opposed to only a polylogarithmic dependence classically). Such a speedup with respect to one parameter (dimension) at the cost of a slowdown with respect to another (precision) is more common in recent quantum algorithms for optimization problems and typically requires a more careful analysis of the impact of approximation errors. Interestingly, for the classical box-constrained Newton method, the minimization subroutine is the bottleneck, whereas in our quantum algorithm, the cost of a single iteration is dominated by the time it takes to approximate the vector . Using similar techniques as in [vAGL21], one can obtain an additive -approximation of in time roughly . To obtain an efficient quantum algorithm we therefore need to control throughout the run of the algorithm. We do so efficiently by testing in each iteration whether the -norm of is too large, if it is, we divide the matrix by (by shifting by an appropriate multiple of the all-ones vector), which reduces the potential.
1.3 Open problems
Our lower bound on matrix scaling shows that it is not possible to provide significant quantum speedups for scaling of entrywise-positive matrices in the high-precision scaling regime. However, the best classical upper bound for -scaling when no assumptions are made on the support of the matrices is , where is the number of non-zero entries [CMTV17] (recall that this hides a polylogarithmic dependence on ). The algorithm that achieves this bound is an interior-point method, rather than a box-constrained Newton method. It is an interesting open problem whether such an algorithm also admits a quantum speedup in terms of while retaining a polylogarithmic -dependence. Note that while the interior-point method relies on fast Laplacian system solvers, it is not enough to merely replace this by a quantum Laplacian system solver, as the dimension of the linear system in question is rather than . More generally, the possibility of obtaining quantum advantages in high-precision regimes for optimization problems is still a topic of ongoing investigation.
A second natural question is whether the lower bounds from Theorem 1.2 for computing an approximation of the row sums are tight. The best upper bound for the row-sum vector approximation that we are aware of is the one we use in the upper bound for scaling: we can compute an --approximation of the row- and column sums in time . For constant this matches the lower bound (up to log-factors), but for non-constant it remains an interesting open problem to close the gap between and .
2 Lower bounds for matrix scaling and marginal approximation
In this section we prove two lower bounds: an -lower bound for --scaling matrices with at most non-zero entries, and for an -lower bound for --approximation of the row-sum vector of a normalized matrix (with non-negative entries). The proofs for both lower bounds are based on a reduction from the lower bound given below in Theorem 2.1. In Section 2.1 we construct the associated instances for matrix scaling, and in Section 2.2 we analyze their column marginals after a single iteration of the Sinkhorn algorithm. Afterwards, in Section 2.3 we show that these column marginals are close enough to the target marginals for the reduction to matrix scaling to work, and in Section 2.4 we put the ingredients together, with the main theorem being Theorem 2.15. Finally, in Section 2.5 we prove the lower bound for computing approximations to the row marginals.
The lower bound we reduce from is the following:
Let be even, such that is an integer, and let be an integer. Given binary strings , where has Hamming weight for , computing with probability a string that agrees with in of the positions requires quantum queries.
Let and define the partial Boolean function as
It is known that computing with success probability at least takes quantum queries to [NW99, Cor. 1.2], i.e., the bounded-error quantum query complexity is .
We now proceed with bounding the query complexity of computing of the entries of defined by . We will make use of the general adversary bound [HLŠ07] which is known to satisfy [LMR11, Thm. 1.1]. The strong direct product theorem of Lee and Roland [LR13, Thm. 5.5] says that for every , and integers , every quantum algorithm that outputs a bit string , and makes quantum queries to the bit strings with
has the property that agrees with on at least a -fraction of the entries with probability at most .111In [LR13] the upper bound on is stated in terms of where is the Gram matrix of . For Boolean functions one has [LMR11, Thm. 3.4]. Here
is the Kullback–Leibler divergence between the distributionsand . For , and , one has . Therefore, the strong direct product theorem shows that computing of the entries of correctly, with success probability at least , takes quantum queries. ∎
We will use this lower bound with and . The following intuition is useful to keep in mind. For a fixed , define the matrix whose -th row equals and whose -th row equals . Then has the property that the row-marginals encode the Hamming weights of the , and are all very close to . (This implies that the first row-rescaling step of Sinkhorn’s algorithm encodes the .) Moreover, the column-marginals are exactly uniform. Hence, one may hope that all sufficiently precise scalings of to uniform targets have scaling factors that are close to those given by the first row-rescaling step of Sinkhorn’s algorithm (and hence learn most of the ).
Below we formalize this approach. We show that if one randomly permutes the coordinates of each (independently over ), then with high probability, all -scalings of the resulting matrix are close to the first step of Sinkhorn’s algorithm; here we need to choose sufficiently large () and sufficiently small (). The section is organized as follows. In Section 2.1 we formally define our matrix scaling instances and we analyse the first row-rescaling step of Sinkhorn’s algorithm. In Section 2.2 we show that after the row-rescaling step, with high probability (over the choice of permutations), the column-marginals are close to uniform. In Sections 2.4 and 2.3 we use the strong convexity of the potential from Eq. 1.2 to show that if the above event holds, then all approximate minimizers of can be used to solve the counting problem.
2.1 Definition of the scaling instances and analysis of row marginals
Let be even. Let and let have Hamming weight for . Sample uniformly random permutations and define by . Let be some number depending on , and consider the matrix whose entries are and . Then each column sum is , and the row sums of are given by
be the row scaling factors obtained from a single Sinkhorn step. We first observe that the difference between and permits to recover .
For the specific row-scaling factors for given in (2.1), for every it holds that
Observe that ( and therefore)
2.2 Concentration of column marginals
We first give an explicit expression for the th column marginal of where is given in (2.1).
The matrix has column sums
We now show that with high probability (over the choice of permutations) the column marginals are close to uniform. To do so, we first compute the expectation of (Corollary 2.5). This quantity allows us to obtain the desired concentration of the column marginals via Hoeffding’s inequality (Lemma 2.6).
Let and . Define random variables
. Define random variables, by
Then and .
Observe that each is with probability because is chosen uniformly randomly from , and is with probability . Therefore . By linearity of expectation, the result follows. ∎
For and , with probability at least , we have
Observe first that
For fixed and distinct , and are independently distributed random variables because and are independent. Therefore, is a sum of independent random variables, with each , and Hoeffding’s inequality yields for any that
Assuming that , we have
With this estimate, we see that
For any , with probability , we have
2.3 Strong convexity properties of the potential
For a -strongly convex function , the set has a diameter that is bounded by a function of (we make this well-known fact precise in Lemma 2.11). We show that our potential is strongly convex when viewed as a function from (a suitable subset of) the linear subspace to (note that is invariant under translation by multiples of ). We use this to show that whenever is small, is close to the minimizer of on . It is easy to verify that Corollary 2.7 in fact gives an upper bound on the norm of the gradient at (with as in (2.1)). This implies that is close to the minimizer of on , and by the triangle inequality, is also close to any other for which is small. In the rest of this section we make the above precise.
In Lemma 2.8 we show that the Hessian of restricted to
has smallest eigenvalue at leastwhere is the smallest entry appearing in . In Lemma 2.10 we show that . This implies that for all that are a constant distance away from in the -norm, in other words, is -strongly convex around its minimizer. Lemma 2.12 summarizes these lemmas: it gives a quantitative bound on the distance to a minimizer, in terms of the gradient.
Let be an entrywise non-negative matrix with and let be the potential for this matrix as given in (1.2), where is the orthogonal complement of . Then where is the projection onto and is the smallest entry appearing in . In particular, is strictly convex on .
The Hessian of the potential is given by
We give a lower bound on the non-zero eigenvalues of the Hessian as follows. Conjugating the Hessian with the matrix preserves the spectrum and yields the matrix
One can recognize this as a weighted Laplacian of a complete bipartite graph. We denote by the smallest entry of and we use for the all-ones matrix. Then
where the PSD inequality follows because the difference of the terms is the weighted Laplacian of the bipartite graph with weighted bipartite adjacency matrix , which has non-negative entries. Now observe that the last term is the (unweighted) Laplacian of the complete bipartite graph , whose spectrum is , , with multiplicities , and respectively. The zero eigenvalue corresponds to the all-ones vector of length and it is easy to see that indeed also lies in the kernel of . This shows that the non-zero eigenvalues of are at least
, and that it has a one-dimensional eigenspace corresponding to, spanned by the vector . Hence, . ∎
We now bound the smallest entry of the rescaled matrix. For this we use the following lemma (cf. [KLRS08, Lem. 6.2], [vAGL21, Cor. C.3 (arXiv)]) which bounds the variation norm of the scaling vectors of an exact scaling.
Let and let be such that is exactly -scaled. Then
Let be an entrywise-positive matrix with and let be the potential for this matrix as given in (1.2), where is the orthogonal complement of . Let be the unique minimizer of in . Then . Moreover, for any we have .
By Lemma 2.8 is strictly convex on . We also know that is exactly scalable. Hence has a unique minimizer . By Lemma 2.9 we know that the variation norm of and are bounded by . Hence, for every we have
Therefore, the ratio between entries of is bounded:
Since the sum of the entries of equals , this implies that the smallest entry of is at least . Finally, for any and any we have