Entropy is a fundamental quantity in many areas of science and engineering. The von Neumann entropy, named after John von Neumann, is the extension of classical entropy concepts to the field of quantum mechanics, and its foundations can be traced to von Neumann’s work on Mathematische Grundlagen der Quantenmechanik111Originally published in German in 1932; published in English under the title Mathematical Foundations of Quantum Mechanics in 1955.. In his work, Von Neumann introduced the notion of the density matrix, which facilitated the extension of the tools of classical statistical mechanics to the quantum domain in order to develop a theory of quantum measurements.
From a mathematical perspective (see Section 1.1 for details) the density matrix is a symmetric positive semidefinite matrix in with unit trace. Let , be the eigenvalues of in decreasing order; then, the entropy of is defined as222 is symmetric positive semidefinite and thus all its eigenvalues are non-negative. If is equal to zero we set to zero as well.
The above definition is a proper extension of both the Gibbs entropy and the Shannon entropy to the quantum case and implies an obvious algorithm to compute by first computing the eigendecomposition of ; known algorithms for this task necessitate time . Clearly, as grows, such running times are impractical. For example,  describes an entangled two-photon state generated by spontaneous parametric down-conversion, which can result in a density matrix with .
Motivated by the above discussion, we seek numerical algorithms that approximate the von Neumann entropy of large density matrices, e.g., symmetric positive definite matrices with unit trace, faster than the trivial approach. Our algorithms build upon recent developments in the field of Randomized Numerical Linear Algebra (RandNLA), an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems. Indeed, our work here focuses at the intersection of RandNLA and information theory, delivering novel randomized linear algebra algorithms and related quality-of-approximation results for a fundamental information-theoretic metric.
We will focus on finite-dimensional function (state) spaces. In this setting, the density matrix represents the statistical mixture of pure states, and has the form
The vectorsfor represent the pure states and can be assumed to be pairwise orthogonal and normal while the
’s correspond to the probability of each state and satisfyand . From a linear algebraic perspective, eqn. (2) can be rewritten as
where is the matrix whose columns are the vectors and is a diagonal matrix whose entries are the (positive) ’s. Given our assumptions for , ; also is symmetric positive semidefinite with its eigenvalues equal to the and corresponding left/right singular vectors equal to the ’s; and . Notice that eqn. (3
) essentially reveals the (thin) Singular Value Decomposition (SVD) of . The Von Neumann entropy of , denoted by is equal to (see also eqn. (1))
The second equality follows from the definition of matrix functions . More precisely, we overload notation and consider the full SVD of , namely , where
is an orthogonal matrix whose topcolumns correspond to the pure states and the bottom columns are chosen so that . Here is a diagonal matrix whose bottom diagonal entries are set to zero. Let for any and let . Then, using the cyclical property of the trace and the definition of ,
1.2 Our contributions
We present and analyze three randomized algorithms to approximate the von Neumann entropy of density matrices. The first two algorithms (Sections 2 and 3) leverage two different polynomial approximations of the matrix function : the first approximation uses a Taylor series expansion while the second approximation uses Chebyschev polynomials. Both algorithms return, with high probability, relative-error approximations to the true entropy of the input density matrix, under certain assumptions. More specifically, in both cases, we need to assume that the input density matrix has non-zero eigenvalues, or, equivalently, that the probabilities , , corresponding to the underlying pure states are non-zero. The running time of both algorithms is proportional to the sparsity of the input density matrix and depends (see Theorems 1 and 3 for precise statements) on, roughly, the ratio of the largest to the smallest probability (recall that the smallest probability is assumed to be non-zero), as well as the desired accuracy.
The third algorithm (Section 4) is fundamentally different, if not orthogonal, to the previous two approaches. It leverages the power of random projections [6, 22] to approximate numerical linear algebra quantities, such as the eigenvalues of a matrix. Assuming that the density matrix has exactly non-zero eigenvalues, e.g., there are pure states with non-zero probabilities , , the proposed algorithm returns, with high probability, relative error approximations to all probabilities . This, in turn, implies an additive-relative error approximation to the entropy of the density matrix, which, under a mild assumption on the true entropy of the density matrix, becomes a relative error approximation (see Theorem 5 for a precise statement). The running time of the algorithm is again proportional to the sparsity of the density matrix and depends on the target accuracy, but, unlike the previous two algorithms, does not depend on any function of the .
From a technical perspective, the theoretical analysis of the first two algorithms proceeds by combining the power of polynomial approximations, either using Taylor series or Chebyschev polynomials, to matrix functions, combined with randomized trace estimators. A provably accurate variant of the power method is used to estimate the largest probability . If this estimate is significantly smaller than one, it can improve the running times of the proposed algorithms (see discussion after Theorem 1). The third algorithm leverages a powerful, multiplicative matrix perturbation result that first appeared in . Our work in Section 4 is a novel application of this inequality to derive bounds for RandNLA algorithms.
Finally, in Section 5, we present a detailed evaluation of our algorithms on synthetic density matrices of various sizes, most of which were generated using Matlab’s QETLAB toolbox . For some of the larger matrices that were used in our evaluations, the exact computation of the entropy takes hours, whereas our algorithms return approximations with relative errors well below in only a few minutes.
1.3 Prior work
The first non-trivial algorithm to approximate the von Neumann entropy of a density matrix appeared in . Their approach is essentially the same as our approach in Section 3. Indeed, our algorithm in Section 3 was inspired by their approach. However, our analysis is somewhat different, leveraging a provably accurate variant of the power method as well as provably accurate trace estimators to derive a relative error approximation to the entropy of a density matrix, under appropriate assumptions. A detailed, technical comparison between our results in Section 3 and the work of  is delegated to Section 3.3.
Independently and in parallel with our work, 
presented a multipoint interpolation algorithm (building upon) to compute a relative error approximation for the entropy of a real matrix with bounded condition number. The proposed running time of Theorem 35 of  does not depend on the condition number of the input matrix (i.e., the ratio of the largest to the smallest probability), which is a clear advantage in the case of ill-conditioned matrices. However, the dependency of the algorithm of Theorem 35 of  on terms like or (where represents the number of non-zero elements of the matrix ) could blow up the running time of the proposed algorithm for reasonably conditioned matrices.
We also note the recent work in , which used Taylor approximations to matrix functions to estimate the log determinant of symmetric positive definite matrices (see also Section 1.2 of  for an overview of prior work on approximating matrix functions via Taylor series). The work of  used a Chebyschev polynomial approximation to estimate the log determinant of a matrix and is reminiscent of our approach in Section 3 and, of course, the work of .
We conclude the section by noting that our algorithms will use two tools (described, for the sake of completeness, in the Appendix) that appeared in prior work. The first tool is the power method, with a provable analysis that first appeared in . The second tool is a provably accurate trace estimation algorithm for symmetric positive semidefinite matrices that appeared in .
2 An approach via Taylor series
Our first approach to approximate the von Neumann entropy of a density matrix uses a Taylor series expansion to approximate the logarithm of a matrix, combined with a relative-error trace estimator for symmetric positive semi-definite matrices and the power method to upper bound the largest singular value of a matrix.
2.1 Algorithm and Main Theorem
Our main result is an analysis of Algorithm 1 (see below) that guarantees relative error approximation to the entropy of the density matrix , under the assumption that has pure states with for all .
The following theorem is our main quality-of-approximation result for Algorithm 1.
A few remarks are necessary to better understand the above theorem. First, could be set to , the smallest of the probabilities corresponding to the pure states of the density matrix . Second, it should be obvious that in Algorithm 1 could be simply set to one and thus we could avoid calling Algorithm 6 to estimate by and thus compute . However, if is small, then could be significantly smaller than one, thus reducing the running time of Algorithm 1, which depends on the ratio . Third, ideally, if both and were used instead of and , respectively, the running time of the algorithm would scale with the ratio .
2.2 Proof of Theorem 1
Let be a symmetric positive definite matrix with unit trace and whose eigenvalues lie in the interval , for some . Then,
From the definition of the von Neumann entropy and a Taylor expansion,
Eqn. (2.2) follows since has unit trace and from a Taylor expansion: indeed, for a symmetric matrix whose eigenvalues are all in the interval . We note that the eigenvalues of are in the interval , whose upper bound is strictly less than one since, by our assumptions, . ∎
We now proceed to prove Theorem 1. We will condition our analysis on Algorithm 6 being successful, which happens with probability at least . In this case, is an upper bound for all probabilities . For notational convenience, set . We start by manipulating as follows:
We now bound the two terms and separately. We start with : the idea is to apply Lemma 10 on the matrix with . Hence, with probability at least :
A subtle point in applying Lemma 10 is that the matrix must be symmetric positive semidefinite. To prove this, let the SVD of be , where all three matrices are in and the diagonal entries of are in the interval . Then, it is easy to see that and , where the diagonal entries of are non-negative, since the largest entry in is upper bounded by . This proves that is symmetric positive semidefinite for any , a fact which will be useful throughout the proof. Now,
which shows that the matrix of interest is symmetric positive semidefinite. Additionally, since is symmetric positive semidefinite, its trace is non-negative, which proves the second inequality in eqn. (7) as well.
We proceed to bound as follows:
To prove eqn. (8), we used von Neumann’s trace inequality333Indeed, for any two matrices and , , where (respectively ) denotes the -th singular value of (respectively ). Since (its largest singular value), this implies that ; if is symmetric positive semidefinite, .. Eqn. (8) now follows since is symmetric positive semidefinite444This can be proven using an argument similar to the one used to prove eqn. (7).. To prove eqn. (9), we used the fact that for any . Finally, to prove eqn. (10), we used the fact that since the smallest entry in is at least by our assumptions. We also removed unnecessary absolute values since is non-negative for any positive integer .
Combining the bounds for and gives
We have already proven in Lemma 2 that
where the last inequality follows since . Collecting our results, we get
and using (), guarantees that and concludes the proof of the theorem. We note that the failure probability of the algorithm is at most (the sum of the failure probabilities of the power method and the trace estimation algorithm).
3 An approach via Chebyschev polynomials
Our second approach is to use a Chebyschev polynomial-based approximation scheme to estimate the entropy of a density matrix. Our approach follows the work of , but our analysis uses the trace estimators of  and Algorithm 6 and its analysis. Importantly, we present conditions under which the proposed approach is competitive with the approach of Section 2.
3.1 Algorithm and Main Theorem
The proposed algorithm leverages the fact that the von Neumann entropy of a density matrix is equal to the (negative) trace of the matrix function and approximates the function by a sum of Chebyschev polynomials; then, the trace of the resulting matrix is estimated using the trace estimator of .
Let with , , and for . Let and be the Chebyschev polynomials of the first kind for any integer . Algorithm 2 computes (an upper bound estimate for the largest probability of the density matrix ) and then computes and estimates its trace. We note that the computation can be done efficiently using Clenshaw’s algorithm; see Appendix C for the well-known approach.
Our main result is an analysis of Algorithm 2 that guarantees a relative error approximation to the entropy of the density matrix , under the assumption that has pure states with for all . The following theorem is our main quality-of-approximation result for Algorithm 2.
The similarities between Theorems 1 and 3 are obvious: same assumptions and directly comparable accuracy guarantees. The only difference is in the running times: the Taylor series approach has a milder dependency on , while the Chebyschev-based approximation has a milder dependency on the ratio , which controls the behavior of the probabilities . However, for small values of (),
Thus, the Chebyschev-based approximation has a milder dependency on but not necessarily when compared to the Taylor-series approach. We also note that the discussion following Theorem 1 is again applicable here.
3.2 Proof of Theorem 3
We will condition our analysis on Algorithm 6 being successful, which happens with probability at least . In this case, is an upper bound for all probabilities . We now recall (from Section 1.1) the definition of the function for any real , with . Let be the density matrix, where both and are matrices in . Notice that the diagonal entries of are the s and they satisfy for all .
Using the definitions of matrix functions from , we can now define , where is a diagonal matrix in with entries equal to for all . We now restate Proposition 3.1 from  in the context of our work, using our notation.
The function in the interval can be approximated by
where , , and for . For any ,
In the above, for any integer and . Notice that the function essentially maps the interval , which is the interval of interest for the function , to , which is the interval over which Chebyschev polynomials are commonly defined. The above theorem exploits the fact that the Chebyschev polynomials form an orthonormal basis for the space of functions over the interval .
We now move on to approximate the entropy using the function . First,
Recall from Section 1.1 that . We can now bound the difference between and . Indeed,
The last inequality follows by the final bound in Lemma 4, since all ’s are the in the interval .
Recall that we also assumed that all s are lower-bounded by and thus
We note that the upper bound on the s follows since the smallest is at least and thus the largest cannot exceed . We note that we cannot use the upper bound in the above formula, since could be equal to one; is always strictly less than one but it cannot be a priori computed (and thus cannot be used in Algorithm 2), since is not a priori known.
We can now restate the bound of eqn. (12) as follows:
where the last inequality follows by setting
Next, we argue that the matrix is symmetric positive semidefinite (under our assumptions) and thus one can apply Lemma 10 to estimate its trace. We note that
which trivially proves the symmetry of and also shows that its eigenvalues are equal to for all . We now bound
using our upper () and lower () bounds on the s. Now proves that are non-negative for all and thus is a symmetric positive semidefinite matrix; it follows that its trace is also non-negative.
We can now apply the trace estimator of Lemma 10 to get
For the above bound to hold, we need to set
We now conclude as follows:
The first inequality follows by adding and subtracting and using sub-additivity of the absolute value; the second inequality follows by eqns. (14) and (16); the third inequality follows again by eqn. (14); and the last inequality follows by using .
We note that the failure probability of the algorithm is at most (the sum of the failure probabilities of the power method and the trace estimation algorithm). Finally, we discuss the running time of Algorithm 2, which is equal to . Using the values for and from eqns. (15) and (17), the running time becomes (after accounting for the running time of Algorithm 6)
3.3 A comparison with the results of 
The work of  culminates to the error bounds described in Theorem 4.3 (and the ensuing discussion). In our parlance,  first derives the error bound of eqn. (12). It is worth emphasizing that the bound of eqn. (12) holds even if the s are not necessarily strictly positive, as assumed by Theorem 3: the bound holds even if some of the s are equal to zero.
Unfortunately, without imposing a lower bound assumption on the s it is difficult to get a meaningful error bound and an efficient algorithm. Indeed, the error implied by eqn. (12) (without any assumption on the s) necessitates setting to at least (perhaps up to a logarithmic factor, as we will discuss shortly). To understand this, note that the entropy of the density matrix ranges between zero and , where is the rank of the matrix , i.e., the number of non-zero ’s. Clearly, and thus is an upper bound for . Notice that if is smaller than , the error bound of eqn. (12) does not even guarantee that the resulting approximation will be positive, which is, of course, meaningless as an approximation to the entropy.
In order to guarantee a relative error bound of the form via eqn. (12), we need to set to be at least
which even for “large” values of (i.e., values close to the upper bound ) still implies that is . Even with such a large value for , we are still not done: we need an efficient trace estimation procedure for the matrix . While this matrix is always symmetric, it is not necessarily positive or negative semi-definite (unless additional assumptions are imposed on the s, like we did in Theorem 3). Unfortunately, we are not aware of any provably accurate, relative error approximation algorithms for the trace of just symmetric matrices: the results of [2, 18] only apply to symmetric positive (or negative) semidefinite matrices. The work of  does provide an analysis of a trace estimator for general symmetric matrices (pioneered by Hutchinson in ). However, in our notation, in order to achieve a relative error bound, the final error bound of  (see eqns. (19) and (20) in ), necessitates setting to the value of eqn. (18). However, in that case, (the number of random vectors to be generated in order to estimate the trace of a general symmetric matrix) grows as a function of555Up to logarithmic factors. (see eqn. (20) in ), which is prohibitively large, as with that many random vectors the running time of just the trace estimation algorithm blows to , which could easily exceed the trivial running time to exactly compute .
4 An approach via random projection matrices
Finally, we focus on perhaps the most interesting special case: the setting where at most (out of , with ) of the probabilities of the density matrix of eqn. (2) are non-zero. In this setting, we prove that elegant random-projection-based techniques achieve relative error approximations to all probabilities , . The running time of the proposed approach depends on the particular random projection that is used and can be made to depend on the sparsity of the input matrix.
4.1 Algorithm and Main Theorem
The proposed algorithm uses a random projection matrix to create a “sketch” of in order to approximate the s.
In words, Algorithm 3 creates a sketch of the input matrix by post-multiplying by a random projection matrix; this is a well-known approach from the RandNLA literature (see  for details). Assuming that has rank at most , which is equivalent to assuming that at most of the probabilities in eqn. (2) are non-zero (e.g., the system underlying the density matrix has at most pure states), then the rank of is also at most . In this setting, Algorithm 3 returns the non-zero singular values of as approximations to the , .
The following theorem is our main quality-of-approximation result for Algorithm 3.
Comparing the above result with Theorems 1 and 3, we note that the above theorem does not necessitate imposing any constraints on the probabilities , . Instead, it suffices to have non-zero probabilities. The final result is an additive-relative error approximation to the entropy of (as opposed to the relative error approximations of Theorems 1 and 3); under the mild assumption , the above bound becomes a true relative error approximation666Recall that ranges between zero and ..
4.2 Two constructions for the random projection matrix
We now discuss two constructions for the matrix and we cite two bounds regarding these constructions from prior work that will be useful in our analysis. The first construction is the subsampled Hadamard Transform, a simplification of the Fast Johnson-Lindenstrauss Transform of ; see [7, 20] for details. We do note that even though it appears that Algorithm 5 is always better than Algorithm 4 (at least in terms of their respective theoretical running times), both algorithms are worth evaluating experimentally: in particular, prior work  has reported that Algorithm 4 often outperforms Algorithm 5 is terms of empirical accuracy and running time when the input matrix is dense, as is often the case in our setting. Therefore, we choose to present results (theoretical and empirical) for both well-known constructions of (Algorithms 4 and 5).
Let such that and let be constructed by Algorithm 4. Then, with probability at least 0.9,
by setting .
Let such that and let be constructed by Algorithm 5. Then, with probability at least 0.9,