1 Introduction
Many applications require the trace calculation of the matrix whose explicit form is not given but only its matrix multiplication to an arbitrary vector is available. An example is the trace calculation of the inverse of Dirac matrix in lattice quantum chromodynamics (QCD) [1]. The Dirac matrix is large and sparse so that explicit matrix inversion is not feasible, but the matrix multiplication of the inverse matrix to a vector is available through iterative linear equation solvers, such as the Conjugate Gradient (CG) algorithm [2]. In general, an exact solution can be obtained by matrixvector multiplications for a matrix. However, such an exact estimation is computationally expensive for large matrices, so stochastic approaches, such as the Hutchison estimator[3], are widely used. For specific structures of matrices, such as the banded matrices, improved trace estimators can be defined by exploiting the matrix structure, so that precise trace estimates are obtained by small number of matrixvector multiplications [4, 5, 6]. In general, however, the matrix structure is too complicated to be exploited by human knowledge, while it can be explored by using machine learning.
Machine learning is a field of science that makes a computer to act based on the learning from data. In contrast to the specific models with small number of free parameters in conventional data analysis, machine learning uses general models with large number of free parameters, such as the artificial neural network (ANN)
[7, 8]or the support vector machines (SVMs)
[9]. Based on the generality of the learning model and large number of free parameters, computer builds its own model from data.In this paper, the idea of machine learning is applied to the trace estimation. We build a trace estimator with high degrees of freedom, and train the estimator by using large number of matrices that have similar structures. As other machine learning applications, the learning procedure is computationally expensive, but the trace estimation using the trained estimator requires only small number of matrixvector multiplications. In section
2, we define the new trace estimator and learning procedure, and discuss the quality of the estimates and bias correction. Numerical experiments of the new trace estimator are shown in Section 3, and we conclude in Section 4.2 Theory
2.1 Backgrounds
Let us consider a matrix where the matrix itself is difficult to obtain, but its product to a vector is easy to evaluate. The exact trace of the matrix can be calculated by
(1) 
where is the vector whose th component is and other components are . This method requires times of matrixvector multiplications, so it is computationally very expensive for large matrices.
The trace of the matrix can be estimated by using the stochastic estimator
(2) 
with the random noise vectors satisfying
(3) 
The uncertainty of the stochastic trace estimator depends on the structure of the matrix and type of the random noise , because some type of random noise cancel the contribution from the elements of the matrix that cannot contribute to the trace but only to noise for the matrices that have specific structure [10]. One popular choice of the random noise vector is to filling the vector elements following Rademacher distribution^{1}^{1}1 or
with probability
.. Corresponding trace estimator is called the Hutchinson estimator [3].When structure of the matrix is known, one could use a small number of sophisticatedly chosen probing vectors, instead of the large number of random noise vectors, for the precise trace estimation [11, 4, 5]. It is called the probing method. It also can be applied to a trace estimation for the inverse of a sparse matrix by approximating the structure of using the sparsity pattern of the powers of the matrix, [5, 6]. When structure of the matrix is unknown or too complex to be exploited, however, one cannot construct such probing vectors by hand. If there is a large set of matrices that have similar structures, one may consider to find the probing vectors by using those data set instead of constructing the probing vectors by using human knowledge.
2.2 Construction of Probing Vectors using Machine Learning
Our goal is to find a set of probing vectors for that probes the trace of a matrix as
(4) 
for a given number of probing vectors , by using the machine learning. The machine learning requires input data to train the probing vectors: a set of matrices that have similar structures, and the traces of those matrices. Let us consider a set of sample matrices from a distribution. The cost function, which shows the difference between the exact answer and the probing vector estimate, is defined by
(5) 
The main idea is to find the probing vectors that minimize the cost function. In this paper, we consider the online machine learning, which updates the probing vectors using only one input training data (matrix) at each update step, as we find that the online learning is more efficient than the batch learning, which uses more than one input training data in an update step, in this application.
In the practical calculation, the trace of the matrices are estimated by using the Hutchison estimator in Eq. (2) as we are considering matrices whose exact traces are difficult to calculate. In the numerical test given in Section 3, we find that a sloppy estimator of with relatively small number of random noise vectors can construct the probing vectors that yield precise estimates. In addition to the sloppy estimator, reusing an input matrix and its trace estimate multiple times in the training procedure can further reduce the trace estimation cost. Details are given in Section 3.
Let us define a set of probing vectors at a given training time as . In the training process, they are updated as follows
(6) 
for with initial condition . In the numerical experiments, we fill the initial probing vectors
by using randomly chosen numbers following normal distribution of
, but one could use elaborately chosen initial guesses. Here is the gradient of the cost function with respect to the elements of the probing vectors, which can be calculated analytically, and is the parameter called the learning rate. With properly chosen learning rate, the update moves the probing vectors towards the steepest descent direction reducing the cost function. In addition to the steepest descent method, we include the momentum term to improve the speed of convergence [12]. In our numerical experiments, we use depending on the other parameters.Determination of the learning rate is important to make sure that the probing vectors efficiently converge to the global minimum that yields the most precise trace estimate. If it is too small, it converges slowly, and if it is too large, it does not converge to the minimum but oscillate. In early stage of the training, we determine the by using a onedimensional minimizer in order to maximize the training efficiency. After 100 updates, we calculate by the median of the values of the first 100 updates, and scale the learning rate as
(7) 
where is the parameter that determines the speed of scaling. When , the training actively updates the probing vectors to capture the structure of each input matrix, and when , the training finetunes the probing vectors to describe the general structure of input matrices.
2.3 Quality of Estimates and Bias Correction
Once the training is over, the probing vectors can be used for the estimation of trace of a matrix as Eq. (4). Naturally, the estimate may not be exact, and it could yield somewhat different value from the exact trace of the matrix. Considering a set of test matrices , one can define the deviation of the estimate from the exact trace by
(8) 
The quality of estimates can be assessed based on two factors: (1) nonzero mean value of
, which gives the bias of estimate and (2) fluctuation (standard deviation) of
over different matrices , which is related to the accuracy of the estimate.The bias can be corrected easily by redefining the trace estimator as
(9) 
where is an average of over a fixed number of sample matrices. In general, one needs to determine the more precise than the fluctuation of so that we can ignore the bias in the error estimation. Since the statistical uncertainty of an average over samples is of the standard deviation, averaging over 100 samples of will give a precise enough estimation.
Practically, the exact value of is very difficult to obtain because it requires the exact value of . Instead of the exact calculation, one can use the stochastic estimates of those. Due to the uncertainty of the stochastic estimation, one would need large number of samples of to determine precisely. However, the computation cost can be negligible if one reuses the stochastic trace estimates used in the training procedure.
After the correction, uncertainty of the trace estimator Eq. (9) depends only on the fluctuation of . If is distributed normally, one can report the standard deviation of from the set of test matrices as an uncertainty of the trace estimate. Otherwise, one can devise a proper error estimation based on the distribution of .
If one calculates an expectation value of a function of traces for a set of matrices, an unbiased estimator can be defined by applying the technique used in Refs. [13, 14] as
(10) 
The first term on the right hand side is the average calculated by using the probing vector trace estimator defined in Eq. (9). The second term on the right hand side is the bias correction term that takes care of all the misestimation of the trace using the probing vectors. Here we assume that are independent and identically distributed, so use only the first matrices for the correction term, but they could be chosen randomly or maximally separated in the sequence of . The in the equation can be any type of unbiased trace estimator, such as the exact estimator in Eq. (1) or the stochastic estimator in Eq. (2).
The role of the second term is following. First, let us take expectation value to the right hand side of Eq. (10). The terms using the biased estimator cancel, and the only remaining is the term using unbiased estimator, . Hence the estimator defined in Eq. (10
) is unbiased. Second, let us take variance to the right hand side of Eq. (
10) in order to see the statistical uncertainty of the estimate. In addition to the variance of the first term, the variance of the second term and the correlation between the first and second term comes in. In this way, the correction term converts the systematic uncertainty of the trace estimator using probing vectors to the statistical uncertainty, keeping the final estimates unbiased. If one uses the unbiased estimator defined in Eq. (10), one does not need to include the bias correction term in Eq. (9) because the bias will be corrected automatically in the new definition of the estimator. However, including the correction might reduce the statistical uncertainty in final estimation in some cases.The size of statistical uncertainty increased by the correction term is depending on two factors: (1) the correlation between and , and (2) the number of samples used in the correction term, . If the trace estimate using probing vectors is good enough so that the function values from those two trace calculations and are very close, the contribution of the correction term to the overall statistical uncertainty would be small, and one could use small . If the trace estimate using probing vectors is not precise and the two function values are quite different, one would need to use large to suppress the increase of statistical error due to the correction term. In general, we use the unbiased estimator Eq. (10) when we can take .
3 Numerical Experiments
In this section, we demonstrate the performance of the trace estimator using probing vectors described in the previous section. For the numerical test, we train probing vectors to estimate the trace of inverse of the random matrix generated by
(11) 
where
are independent and identically distributed random variables following normal distribution of
. Note that this matrix is an discrete symmetric derivative with identity: , except the Gaussian noise. It has the similar form with the Dirac operator in Lattice QCD [15, 16]. This is a sparse matrix whose inverse is expensive to calculate, but the matrixvector product of the inverse matrix can be calculated by using iterative methods. In this numerical test, we use BiCGSTAB algorithm[17] to evaluate the product of the inverse matrix and an arbitrary vector.L  100  1000  Note 

8  16  Probing vectors  
800  1600  Random noise vectors  
0.8  0.8  Momentum term  
Learning rate scheduling  
Pool size of input matrices  
Trainings 
We test two different sizes of matrices: and . Note that the training requires input matrices that have similar structures and the traces of each matrix. Although it is possible to use the exact trace calculations on those sizes of matrices we are testing in this section, we use estimates from the Hutchison estimator Eq. (2) for the traces of the input matrices so that the results are scalable to larger size matrices, whose exact traces are very expensive to calculate. The traces of inverse matrices are 70.92 0.07 and 709.20 0.22 for and , respectively. The numbers after “” notation are the standard deviation of the traces of random matrices. Unless explicitly noted, the discussion in this section is using the training parameters given in Table 1.
In the simulation, we find that the distribution of the estimation error , defined in Eq. (8), is following the normal distribution, as shown in Fig. 2. Hence we use the standard deviation of as the uncertainty of the estimates. Fig. 2 shows the learning efficiency for different number of noise vectors used in the Hutchison trace estimator. It shows that the learning efficiency is saturating when the number of noise vectors is larger than 800 (1600) for (1000). One interesting point is that the final uncertainty of the trace estimator is much smaller than the accuracy of the Hutchison stochastic estimator we used in the training. By comparing the uncertainty of the Hutchison stochastic trace estimator given in Fig. 3, one finds that the error of the trace estimator using probing vectors are similar to that of Hutchison stochastic estimator with about 20000 noise vectors.
Although we use relatively small number of noise vectors for the trace estimation of the input matrices for training, the most expensive part of the computation is the trace estimation. Hence we generate a finite size of pool of input matrices, and randomly pick up input matrices for the training from the pool. In the training procedure, the same matrix can be picked up multiple times. Reusing the same trace estimation calculated at the first pickup significantly reduces the training cost. Fig. 4 shows the learning efficiency for different size of the input matrix pool. It shows that the pool size of input matrices is large enough to give reasonable training results.
Another important parameter in the training is the that determines the scheduling of the learning rate, Eq. (7). If the is too large, the learning rate becomes tiny in early stage of the training, and it takes very long training period to converge. If the is too small, the trace estimator using probing vectors tends to have large bias for finite training period. Fig. 5 shows the learning efficiency and bias of the estimator for different values of . As expected, the training is not efficient when is too small or too large, but the bias is smaller for the trainings with smaller values.
As discussed in the previous section, the bias can be easily corrected by using the formulae given in Eq. (9) or Eq. (10), so the bias is not a concern in choosing a proper value of . However, the smaller value of needs larger number of noise vectors for the trace estimation of training input matrices, as shown in Fig. 6. The figure shows the training efficiency for different number of noise vectors when . By comparing it with Fig. 2, one finds that the training efficiency of training is worse than that of training when the number of noise vectors is small. This is because the learning rate is relatively large when is small. When the learning rate is large, training results are shifting for trainings of each input matrix. If the traces of input matrices are poorly determined, the training results are contaminated and fluctuating by the errors of the trace estimations of the input matrices. When learning rate is small, however, training results depend on series of input matrices, not only on the last one input matrix, so the errors of trace estimation of the input matrices are averaged out.
Fig. 7 shows the training efficiency for different number of probing vectors . As expected, the more probing vectors give the smaller estimate uncertainty. Except the smallest number of probing vector cases, the estimation errors after the training are about for , and for , depending on the number of probing vectors. By comparing the results with the efficiency of the Hutchison stochastic estimator, Fig. 3, one finds that the accuracy of the probing vector trace estimation with about 30 probing vectors is similar to that of the Hutchison estimator with random noise vectors.
4 Conclusion
We proposed a new trace estimator of the matrix whose explicit form is difficult to obtain but its matrix multiplication to a vector is easy to evaluate. The form of the estimator is similar to the Hutchison stochastic trace estimator, but instead of the random noise vectors in Hutchison estimator, we use small number of probing vectors determined by the machine learning technique. Through the training procedure, the probing vectors become probes for the trace of a matrix with given structure. Similar to other machine learning applications, the training procedure is computationally expensive, as it requires unbiased trace estimations for the input matrices, and multiple matrixvector multiplications. Once training is over, however, it becomes an efficient trace estimator that requires only small number of matrixvector multiplications. In the numerical experiments, it is shown that the accuracy of the trace estimator with probing vectors is similar to that of the Hutchison estimator with random noise vectors. Comparing at the same number of vectors, the estimator with probing vectors has more than 20 times smaller uncertainty than the Hutchison estimator with random noise vectors. The error of the probing vector estimator can be further reduced if one extends the training period with retuned parameters.
The quality of estimates are evaluated by using the deviation of the estimate from the exact trace, , defined in Eq. (8). In the numerical experiments we performed, it turns out that is following the normal distribution. Hence we use the standard deviation of as an uncertainty of the trace estimates. The trace estimates of the probing vector estimators can be biased, but a bias correction method using is introduced in Eq. (9). Another unbiased estimator, which converts the systematic error of the trace estimator to a statistical uncertainty, is presented in Eq. (10).
In this paper, we explored a general trace estimator using the machine learning idea. One might be able to define a better trace estimator by exploiting the symmetry conditions in a given system. Furthermore, the idea of exploring a matrix structure by using machine learning would be applicable to many other matrix problems.
Acknowledgements
We thank Garrett Kenyon for discussions on machine learning algorithms. This research used resources of the Institutional Computing at Los Alamos National Laboratory, and the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725. This work is supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics under Contract No. DEKA1401020, and the LANL LDRD program.
References

[1]
Tanmoy Bhattacharya, Vincenzo Cirigliano, Rajan Gupta, HueyWen Lin, and Boram
Yoon.
Neutron Electric Dipole Moment and Tensor Charges from Lattice QCD.
Phys. Rev. Lett., 115(21):212002, 2015.  [2] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49:409–436, 1952.
 [3] M. Hutchinson. A stochastic estimator for the trace of the influence matrix for Laplacian smoothing splines. Commun. Statist.Simula., 18:1059–1076, 1989.
 [4] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Appl. Numer. Math., 57(1112):1214–1229, 2007.
 [5] Jok M. Tang and Yousef Saad. A probing method for computing the diagonal of a matrix inverse. Numerical Linear Algebra with Applications, 19(3):485–501, 2012.
 [6] Andreas Stathopoulos, Jesse Laeuchli, and Kostas Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. 2013.
 [7] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
 [8] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, Cambridge, MA, 1974.

[9]
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik.
A training algorithm for optimal margin classifiers.
InProceedings of the Fifth Annual Workshop on Computational Learning Theory
, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.  [10] Tanmoy Bhattacharya, Rajan Gupta, and Boram Yoon. Disconnected Quark Loop Contributions to Nucleon Structure. PoS, LATTICE2014:141, 2014.
 [11] Thomas F. Coleman and Jorge J. Moré. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM Journal on Numerical Analysis, 20(1):187–209, 1983.
 [12] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Backpropagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
 [13] Gunnar S. Bali, Sara Collins, and Andreas Schafer. Effective noise reduction techniques for disconnected loops in Lattice QCD. Comput. Phys. Commun., 181:1570–1583, 2010.
 [14] Thomas Blum, Taku Izubuchi, and Eigo Shintani. New class of variancereduction techniques using lattice symmetries. Phys. Rev., D88(9):094503, 2013.
 [15] T. DeGrand and C. DeTar. Lattice Methods for Quantum Chromodynamics. World Scientific, 2006.
 [16] C. Gattringer and C. Lang. Quantum Chromodynamics on the Lattice: An Introductory Presentation. Lecture Notes in Physics. Springer Berlin Heidelberg, 2009.
 [17] H. A. van der Vorst. BICGSTAB: A fast and smoothly converging variant of bicg for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13(2):631–644, March 1992.
Comments
There are no comments yet.