The matrix exponential operator has applications in numerous domains, ranging from time integration of Ordinary Differential Equations or network analysis  to various simulation problems (like power grids  or nuclear reactions 
) or machine learning. In graph signal processing, it appears in the diffusion process of a graph signal – an analog on graphs of Gaussian low-pass filtering.
Given a graph and its combinatorial Laplacian matrix , let
be a signal on this graph (a vector containing a value at each node), thediffusion of in is defined by the equation with . It admits a closed-form solution involving the heat kernel , which features the matrix exponential.
Applying the exponential of a matrix to a vector can be achieved by computing the matrix to compute then the matrix-vector product . However, this becomes quickly computationally prohibitive in high dimension, as storing and computing , as well as the matrix-vector product , have cost at least quadratic in . Moreover, multiscale graph representations such as graph wavelets , graph-based machine learning methods , rely on graph diffusion at different scales, thus implying applications of the matrix exponential of various multiples of the graph Laplacian.
To speedup such repeated computations one can use a well-known technique based on approximations of the (scalar) exponential function using Chebyshev polynomials. We build on the fact that polynomial approximations  can significantly reduce the computational burden of approximating with good precision when where is sparse positive semi-definite (PSD); this is often the case when is the Laplacian of a graph when each node is connected to a limited number of neighbors. The principle is to approximate the exponential as a low-degree polynomial in , . Several methods exist, some requiring the explicit computation of coefficients associated with a particular choice of polynomial basis, others, including Krylov-based techniques, not requiring explicit evaluation of the coefficients but relying on an iterative determination  of the polynomial approximation on the subspace spanned by .
Our contribution is twofold. First, we devise a new bound on the approximation error of truncated Chebyshev expansions of the exponential, that improves upon existing works [10, 11, 12]. This avoids unnecessary computations by determining a small truncation order to achieve a prescribed error. Second, we propose to compute at different scales faster, by reusing the calculations of the action of Chebyshev polynomials on and combining them with adapted coefficients for each scale . This is particularly efficient for multiscale problems with arbitrary values of , unlike  which is limited to linear spacing.
The rest of this document is organized as follows. In Section II we describe univariate function approximation with Chebyshev polynomials, and detail the approximation of scaled univariate exponential functions with new bounds on the coefficients (Corollary II.2). This is used in Section III to approximate matrix exponentials with controlled complexity and controlled error (Lemma III.1), leading to our new error bounds (17), (18), (19). Section IV is dedicated to an experimental validation, with a comparison to the state-of-the-art bounds of , and an illustration on multiscale diffusion.
Ii Chebyshev approximation of the exponential
The Chebyshev polynomials of the first kind are characterized by the identity . They can be computed as , and using the following recurrence relation:
The Chebyshev series decomposition of a function is: , where the Chebyshev coefficients are:
Truncating this series yields an approximation of . For theoretical aspects of the approximation by Chebyshev polynomials (and other polynomial basis) we refer the reader to .
Ii-a Chebyshev series of the exponential
We focus on approximating the univariate transfer function , which will be useful to obtain low-degree polynomial approximations of the matrix exponential
for positive semi-definite matrices whose largest eigenvalue satisfies(see Section III).
Using a change of variable:
and the Chebyshev series of yields:
This leads to the following expression for :
where for any : .
Ii-B Chebyshev coefficients of the exponential operator
with the modified Bessel function of the first kind and the exponentially scaled modified Bessel function of the first kind.
The following lemma applied to yields another expression of the coefficients (3), which will be used to bound the error of the truncated Chebyshev expansion.
Lemma II.1 (, Equation 2.91).
Let be a function expressed as an infinite power series and assume that this series is uniformly convergent on . Then, we can express the Chebyshev coefficients of by:
Consider , . The coefficients of its Chebyshev expansion satisfy:
Moreover we have:
Denoting , we expand into a power series:
Using Lemma II.1, we obtain for each :
For any integers we have and hence
Iii Approximation of the matrix exponential
The extension of a univariate function to symmetric matrices exploits the eigen-decomposition , where , to define the action of as . When for some integer k, this yields , hence the definition matches with the intuition when is polynomial or analytic.
The exponential of a matrix could be computed by taking the exponential of the eigenvalues, but diagonalizing the matrix would be computationally prohibitive. However computing a matrix such as is rarely required, as one rather needs to compute its action on a given vector. This enables faster methods, notably using polynomial approximations: given a square symmetric matrix and a univariate function , a suitable univariate polynomial is used to approximate with . Such a polynomial can depend on both and . When the function admits a Taylor expansion, a natural choice for is a truncated version of the Taylor series . Other polynomial bases can be used, such as the Padé polynomials, or in our case, the Chebyshev polynomials [11, 16] (see  for a survey), leading to approximation errors that decay exponentially with the polynomial order .
Iii-a Chebyshev approximation of the matrix exponential
Consider any PSD matrix of largest eigenvalue (adaptations to matrices with arbitrary largest eigenvalue will be discussed in the experimental section). To approximate the action of , where , we use the matrix polynomial where is the polynomial obtained by truncating the series (4). The truncation order offers a compromise between computational speed and numerical accuracy. The recurrence relations (1) on Chebyshev polynomials yields recurrence relations to compute . Given a polynomial order , computing requires matrix-vector products for the polynomials, and Bessel function evaluations for the coefficients. This cost is dominated by the matrix-vector products, which can be very efficient if is a sparse matrix.
Iii-B Generic bounds on relative approximation errors
Denote the polynomial obtained by truncation at order of the Chebyshev expansion (4). For a given input vector , one can measure a relative error as:
Expressing and in an orthonormal eigenbasis of yields a worst-case relative error:
with the eigenvalues of and .
Consider , as in Section II-A, and a PSD matrix with largest eigenvalue . Consider as above where . With we have
Denote . For we have:
and hence for (10) yields:
Since on (recall that ), we obtain using Corollary II.2:
While (11) is the error of approximation of , relative to the input energy , an alternative is to measure this error w.r.t. the output energy :
Since we have . Using Lemma III.1 we obtain for and any :
Iii-C Specific bounds on relative approximation errors
As the bounds (17)-(18) are worst-case estimates, they may be improved for a specific input signal by taking into account its properties. To illustrate this, let us focus on graph diffusion where is a graph Laplacian, assuming that . Since is the inner product between and the unit constant vector
, which is an eigenvector of the graph Laplacianassociated to the zero eigenvalue , we have . For this leads to the bound:
This bound improves upon (18) if , i.e. when
Considering a graph with Laplacian , the diffusion of a graph signal at scale is obtained by computing . In general, the largest eigenvalue of is not necessarily (except for example if is a so-called normalized graph Laplacian, instead of a combinatorial graph Laplacian). To handle this case with the polynomial approximations studied in the previous section, we first observe that where and . Using Equation (20) with scale allows to select which of the two bounds (18) or (19) is the sharpest. The selected bound is then used to find a polynomial order that satisfies a given precision criterion. Then, we can use the recurrence relations (2) to compute the action of the polynomials on , and combine them with the coefficients given by (5).
Iv-a Bound tightness
Our new bounds accuracy can be illustrated by plotting the minimum truncated order required to achieve a given precision. The new bounds can be compared to the tightest bound we could find in the literature :
where , and:
with and . This bound can be made independent of by using the same procedure as that of used to establish (18):
An experiment was performed over 25 values of ranging from to , 100 samplings of Erdos-Reyni graphs of size , with connection probability (which yields ), and coupled with a random signal with entries drawn i.i.d. from a centered standard normal distribution. For each set of experiment parameters, for each bound, generically noted , the minimum order ensuring was computed, as well as the oracle minimum degree guaranteeing MSE . The median values over graph generations are plotted on Fig 1 against
, with errorbars using quartiles.
We observe that our new bounds (blue) follow more closely the true minimum (black) achieving the targeted precision, up to , thus saving computations over the one of  (red). Also of interest is the fact that the bounds (19)-(21) specific to the input signal are much tighter than their respective generic counterparts (18)-(23).
Iv-B Acceleration of multiscale diffusion
When diffusing at multiple scales , it is worth noting that computations can be factorized. The order can be computed only once (using the largest ), as well as . Eventually, the coefficients can be evaluated for all values to generate the needed linear combinations of , . In order to illustrate this speeding-up phenomenon, our method is compared to scipy.sparse.linalg.expm_multiply, from the standard SciPy Python package, which uses a Taylor approximation combined with a squaring-and-scaling method. See  for details.
For a first experiment, we take the Standford bunny , a graph built from a rabbit ceramic scanning (2503 nodes and 65.490 edges, with ). For the signal, we choose a Dirac located at a random node. We compute repeatedly the diffusion from 2 to 20 scales sampled in . Our method is set with a target error . When the values are linearly spaced, both methods can make use of their respective multiscale acceleration. In this context, our method is about twice faster than Scipy’s; indeed, it takes 0.36 s plus 6.1 s per scale, while Scipy’s takes 0.74 s plus 2.4 s per scale.
On the other hand, when the values are uniformly sampled at random, SciPy cannot make use of its multiscale acceleration. Indeed, its computation cost increases linearly with the number of ’s, with an average cost of 0.39 s per scale. Whereas, the additional cost for repeating our method for each new is negligible (0.0094 s on average) compared to the necessary time to initialize once and for all, the (0.30 s).
The trend observed here holds for larger graphs as well. We run a similar experiment on the ogbn-arxiv graph from the OGB datasets . We take uniformly sampled scales in (following recommendations of ), and set our method for . We observe an average computation time of 504 s per scale (i.e. 1 hr and 24 min for 10 scales) for Scipy’s method, and 87 s plus 50 s per scale for our method (i.e. around 9 min for 10 scales). If we impose a value , comparable to the floating point precision achieved by Scipy, the necessary polynomial order only increases by 6%, which does not jeopardise the computational gain of our method. This behavior gives insight into the advantage of using our fast approximation for addressing the multiscale diffusion on very large graphs.
All experiments are in Python using NumPy and SciPy. They ran on a Intel-Core i5-5300U CPU with 2.30GHz processor and 15.5 GiB RAM on a Linux Mint 20 Cinnamon.
Our contribution is twofold: first, using the now classical Chebyshev approximation of the exponential function, we significantly improved the state of the art theoretical bound used to determine the minimum polynomial order needed for an expected approximation error. Second, in the specific case of the heat diffusion kernel applied to a graph structure, we capitalized on the polynomial properties of the Chebyshev coefficients to factorize the calculus of the diffusion operator, reducing thus drastically its computational cost when applied for several values of the diffusion time.
The first contribution is particularly important when dealing with the exponential of extremely large matrices, not necessarily coding for a particular graph. As our new theoretical bound guarantees the same approximation precision for a polynomial order downsized by up to one order of magnitude, the computational gain is considerable when modeling the action of operators on large mesh grids, as it can be the case, for instance, in finite element calculus.
Our second input is directly related to our initial motivation in  that was to identify the best diffusion time in an optimal transport context. Thanks to our accelerated algorithm, we can afford to repeatedly compute the so-called Diffused Wasserstein distance to find the optimal domain adaptation between graphs’ measures.
The authors wish to thank Nicolas Brisebarre for discussions that helped sharpening some bounds, as well as Hakim Hadj-Djilani for discussions on python implementations.
-  R. M. Mattheij, S. W. Rienstra, and J. T. T. Boonkkamp, Partial differential equations: modeling, analysis, computation. SIAM, 2005.
-  O. De la Cruz Cabrera, M. Matar, and L. Reichel, “Analysis of directed networks via the matrix exponential,” Journal of Computational and Applied Mathematics, vol. 355, pp. 182–192, 2019.
-  H. Zhuang, S.-H. Weng, and C.-K. Cheng, “Power grid simulation using matrix exponential method with rational krylov subspaces,” in 2013 IEEE 10th International Conference on ASIC. IEEE, 2013, pp. 1–4.
-  M. Pusa and J. Leppänen, “Computing the matrix exponential in burnup calculations,” Nuclear science and engineering, vol. 164, no. 2, pp. 140–150, 2010.
-  A. Barbe, M. Sebban, P. Gonçalves, P. Borgnat, and R. Gribonval, “Graph diffusion wasserstein distances,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020.
-  F. R. Chung and F. C. Graham, Spectral graph theory. American Mathematical Soc., 1997, no. 92.
-  M. Mehra, A. Shukla, and G. Leugering, “An adaptive spectral graph wavelet method for pdes on networks,” Adv. Comput. Math., vol. 47, no. 1, p. 12, 2021. [Online]. Available: https://doi.org/10.1007/s10444-020-09824-9
-  M. Popolizio and V. Simoncini, “Acceleration techniques for approximating the matrix exponential operator,” SIAM J. Matrix Anal. Appl., vol. 30, no. 2, pp. 657–683, 2008. [Online]. Available: https://doi.org/10.1137/060672856
-  M. A. Botchev and L. A. Knizhnerman, “ART: adaptive residual-time restarting for krylov subspace matrix exponential evaluations,” J. Comput. Appl. Math., vol. 364, 2020. [Online]. Available: https://doi.org/10.1016/j.cam.2019.06.027
-  V. Druskin and L. Knizhnerman, “Two polynomial methods of calculating functions of symmetric matrices,” USSR Computational Mathematics and Mathematical Physics, vol. 29, no. 6, pp. 112–121, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0041555389800205
-  L. Bergamaschi and M. Vianello, “Efficient computation of the exponential operator for large, sparse, symmetric matrices,” Numer. Linear Algebra Appl., vol. 7, no. 1, pp. 27–45, 2000. [Online]. Available: https://doi.org/10.1002/(SICI)1099-1506(200001/02)7:127::AID-NLA1853.0.CO;2-4
-  J. C. Mason and D. C. Handscomb, Chebyshev polynomials. CRC press, 2002.
-  A. H. Al-Mohy and N. J. Higham, “Computing the action of the matrix exponential, with an application to exponential integrators,” SIAM J. Scientific Computing, vol. 33, no. 2, pp. 488–511, 2011.
-  G. Phillips, Interpolation and Approximation by Polynomials, ser. CMS Books in Mathematics. Springer, 2003. [Online]. Available: https://books.google.fr/books?id=87vciTxMcF8C
-  M. Abramowitz, I. A. Stegun, and R. H. Romer, “Handbook of mathematical functions with formulas, graphs, and mathematical tables,” 1988. [Online]. Available: http://www.math.ubc.ca/~cbm/aands/toc.htm
-  D. I. Shuman, P. Vandergheynst, and P. Frossard, “Chebyshev polynomial approximation for distributed signal processing,” in Distributed Computing in Sensor Systems, 7th IEEE International Conference and Workshops, DCOSS 2011, Barcelona, Spain, 27-29 June, 2011, Proceedings. IEEE Computer Society, 2011, pp. 1–8. [Online]. Available: https://doi.org/10.1109/DCOSS.2011.5982158
-  N. J. Higham and A. H. Al-Mohy, “Computing matrix functions,” Acta Numer., vol. 19, pp. 159–208, 2010. [Online]. Available: https://doi.org/10.1017/S0962492910000036
-  R. Riener and M. Harders, Virtual reality in medicine. Springer Science & Business Media, 2012.
-  W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” 2021.
-  C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec, “Learning structural node embeddings via diffusion wavelets,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul 2018. [Online]. Available: http://dx.doi.org/10.1145/3219819.3220025