1 Introduction
This work is devoted to the design and comparison of tiled algorithms for the bidiagonalization of large matrices. Bidiagonalization is a widely used kernel that transforms a full matrix into bidiagonal form using orthogonal transformations. The importance of bidiagonalization stems from the fact that, in many algorithms, the bidiagonal form is a critical step to compute the singular value decomposition (SVD) of a matrix. The necessity of computing the SVD is present in many computational science and engineering areas. Based on the Eckart–Young theorem
[EckartYoung36], we know that the singular vectors associated with the largest singular values represent the best way (in the 2norm sense) to approximate the matrix. This approximation result leads to many applications, since it means that SVD can be used to extract the “most important” information of a matrix. We can use the SVD for compressing data or making sense of data. In this era of Big Data, we are interested in very large matrices. To reference one out of many application, SVD is needed for principal component analysis (PCA) in Statistics, a widely used method in applied multivariate data analysis.
We consider algorithms for going from a “full” matrix to a condensed “band bidiagonal” form using orthogonal transformations. We use the framework of “algorithms by tiles”. Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the Rbidiagonalization algorithm RBiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the bandbidiagonalization of the Rfactor. For both bidiagonalization algorithms BiDiag and RBiDiag, we use HQRbased reduction trees, where HQR stands for the Hierarchical QR factorization of a tiled matrix [j125]. Considering various reduction trees gives us the flexibility to adapt to matrix shape and machine architecture. In this work, we consider many types of reduction trees. In shared memory, they are named FlatTS, FlatTT, Greedy, and a newly introduced autoadaptive tree, Auto. In distributed memory, they are somewhat more complex and take into account the topology of the machine. The main contributions are the following:

The design and comparison of the BiDiag and RBiDiag tiled algorithms with many types of reduction trees. There is considerable novelty in this. Previous work [4967575, Ltaief:2013:HBR:2450153.2450154, 6267821, Haidar:2013:IPS:2503210.2503292] on tiled bidiagonalization has only considered one type of tree (FlatTS tree) with no RBiDiag. Previous work [Ltaief2012] has considered Greedy trees only for the QR steps in BiDiag (so of the steps), it considers FlatTS tree for the LQ steps ( of the steps), and it does not consider RBiDiag. This paper is the first to study RBiDiag for tiled bidiagonalization algorithm and to study Greedy trees for both steps (LQ and QR) of the tiled bidiagonalization algorithm.

A detailed study of critical path lengths for FlatTS, FlatTT, Greedy with BiDiag and RBiDiag (so six different algorithms in total), which shows that:

The newlyintroduced Greedy based schemes (BiDiag and RBiDiag) are much better than earlier proposed variants with unbounded resources and no communication: for matrices of tiles, , their critical paths have a length instead of for FlatTS and FlatTT

On the one hand, BiDiagGreedy has a shorter critical path length than RBiDiagGreedy for square matrices; on the other hand, RBiDiagGreedy has a shorter critical path length than BiDiagGreedy for “tall and skinny” matrices. For example, for a tile matrix, when go to infinity, and , with , then the asymptotic ratio between BiDiagGreedy and RBiDiagGreedy is .


Implementation of our algorithms within the DPLASMA framework [dplasma6008998], which runs on top of the PaRSEC runtime system [dagueengineBosilca201237], and which enables parallel distributed experiments on multicore nodes. All previous tiled bidiagonalization study [4967575, Ltaief:2013:HBR:2450153.2450154, 6267821, Haidar:2013:IPS:2503210.2503292, Ltaief2012] were limited to shared memory implementation.

A practical autoadaptative tree (Auto) that selftunes for increased performance. This tree is especially useful in many practical situations. For example, it is appropriate (1) when the critical path length is not a major consideration due to limited number of resources or (2) when the intranode “communication” is expensive in which TS kernels are much faster than TT kernels or (3) both.

Experiments on a single multicore node, and on a few multicore nodes of a parallel distributed sharedmemory system, show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts. The new Auto algorithm outperforms its competitors in almost every test case, hence standing as the best algorithmic choice for most users.
The rest of the paper is organized as follows. Section 2 provides a detailed overview of related work. Section 3 describes the BiDiag and RBiDiag algorithms with the FlatTS, FlatTT and Greedy trees. Section 4 is devoted to the analysis of the critical paths of all variants. Section 5 outlines our implementation, and introduces the new Auto reduction tree. Experimental results are reported in Section 6. Finally, conclusion and hints for future work are given in Section 7.
2 Related Work
This section provides an overview of the various approaches to compute the
singular values of a matrix, and positions our new algorithm with respect to existing
numerical software kernels.
Computing the SVD.
The SVD of a matrix is a fundamental matrix
factorization and is a basic tool used in many applications. Computing
the SVD of large matrices in an efficient and scalable way, is an important
problem that has gathered much attention. The matrices considered here are
rectangular by. Without loss of generality, we consider .
We call GE2VAL the problem of computing (only) the singular values of
a matrix, and GESVD the problem of computing the singular values and the
associated singular vectors.
From full to bidiagonal form.
There are few algorithms to compute the singular value decomposition. A large
class of these algorithms consists in first reducing the matrix to bidiagonal
form with orthogonal transformations (GE2BD step), then processing the bidiagonal
matrix to obtain the sought singular values (BD2VAL step).
These two steps (GE2BD and BD2VAL) are very different in nature. GE2BD can be
done in a known number of operations and has no numerical difficulties. On the
other hand, BD2VAL requires the convergence of an iterative process and is
prone to numerical difficulties. This paper mostly focuses on GE2BD: reduction
from full to bidiagonal form.
Clearly, GE2BD+BD2VAL solves GE2VAL: computing (only) the
singular value of a matrix. If the singular vectors are desired (GESVD), one
can also compute them by accumulating the “backward” transformations; in this
example, this would consist in a VAL2BD step followed by a BD2GE step.
In 1965, Golub and Kahan [doi:10.1137/0702016] provides a singular value
solver based on an initial reduction to bidiagonal form. In [doi:10.1137/0702016, Th.
1], the GE2BD
step is done using a QR step on the first
column, then an LQ step on the first row, then a QR step on the second column,
etc. The steps are done one column at a time using Householder transformation.
This algorithm is implemented as a Level2 BLAS algorithm in LAPACK as xGEBD2.
For an by matrix, the cost of this algorithm is (approximately) .
Level 3 BLAS for GE2BD.
In 1989, Dongarra, Sorensen and Hammarling [DONGARRA1989215] explains how
to incorporate Level3 BLAS in LAPACK xGEBD2. The idea is to compute few
Householder transformations in advance, and then to accumulate and apply
them in block using the WY transform [doi:10.1137/0908009]. This
algorithm is available in LAPACK (using the compact WY
transform [doi:10.1137/0910005]) as xGEBRD. Großer and
Lang [GroBer:1999:EPR:330506.330509, Table 1] explain that this algorithm
performs (approximately) 50% of flops in Level 2 BLAS (computing and accumulating Householder
vectors) and 50% in Level 3 BLAS (applying Householder vectors).
In 1995, Choi, Dongarra and Walker [Choi1995] presents the ScaLAPACK version, PxGEBRD, of the LAPACK xGEBRD algorithm of [DONGARRA1989215].
Multistep approach. Further improvements for GE2BD (detailed thereafter) are
possible. These improvements rely on combining multiple steps. These
multistep methods will perform in general much better for GE2VAL
(when only singular
values are sought) than for GESVD (when singular values and singular vectors are sought).
When singular values and singular vectors are sought, all the “multi” steps have to be
performed in “reverse” on the singular vectors adding a nonnegligible
overhead to the singular vector computation.
Preprocessing the bidiagonalization with a QR factorization (preQR step). In 1982, Chan [Chan:1982:IAC:355984.355990] explains that, for tallandskinny matrices, in order to perform less flops, one can preprocess the bidiagonalization step (GE2BD) with a QR factorization. In other words, Chan propose to do preQR(,)+GE2BD(,) instead of GE2BD(,). A curiosity of this algorithm is that it introduces nonzeros where zeros were previously introduced; yet, there is a gain in term of flops. Chan proves that the crossover points when preQR(,)+GE2BD(,) performs less flops than GE2BD(,) is when is greater than . Chan also proved that, asymptotically, preQR(,)+GE2BD(,) will perform half the flops than GE2BD(,) for a fixed and going to infinity. If the singular vectors are sought, preQR has more overhead: (1) the crossover point is moved to more tallandskinny matrices, and there is less gain; also (2) there is some complication as far as storage goes.
The preQR step was somewhat known by the community and an earlier reference is
for example the book of Lawson and Hanson
1974 [doi:10.1137/1.9781611971217]. For this reason, the fact of
preprocessing the bidiagonalization by a QR factorization (preQR) is referred
to as the LHC (Lawson, Hanson and Chan) algorithm in [Trefethen:1997:NLA].
Threestep approach: partialGE2BD+preQR+GE2BD. In 1997, Trefethen and
Bau [Trefethen:1997:NLA] present a “threestep” approach. The idea is
to first start a bidiagonalization; then whenever the ratio passes the
mark, switch to a QR factorization; and then finish up with GE2BD.
Twostep approach: GE2BND+BND2BD. In 1999, Großer and Lang [GroBer:1999:EPR:330506.330509] studied a twostep approach for GE2BD: (1) go from full to band (GE2BND), (2) then go from band to bidiagonal (BND2BD). In this scenario, GE2BND has most of the flops and performs using Level3 BLAS kernels; BND2BD is not using Level3 BLAS but it executes much less flops and operates on a smaller data footprint that might fit better in cache. There is a tradeoff for the bandwidth to be chosen. If the bandwidth is too small, then the first step (GE2BND) will have the same issues as GE2BD. If the bandwidth is too large, then the second step BND2BD will have many flops and dominates the run time.
As mentioned earlier, when the singular vectors are sought (GESVD), the
overhead of the method in term of flops is quite large. In 2013, Haidar,
Kurzak, and Luszczek [Haidar:2013:IPS:2503210.2503292] reported that
despite the extra flops a twostep approach leads to speedup even when
singular vectors are sought (GESVD).
Tiled Algorithms for the SVD. In the context of massive parallelism,
and of reducing data movement, many dense linear algebra algorithms have been moved
to socalled “tiled algorithms”. In the tiled algorithm framework, algorithms
operates on tiles of the matrix, and tasks are scheduled thanks to a
runtime. In the context of the SVD, tiled algorithms naturally leads to band
bidiagonal form. In 2010, Ltaief, Kurzak and Dongarra [4967575] present
a tiled algorithm for GE2BND (to go from full to band bidiagonal form).
In 2013 (technical report in 2011), Ltaief, Luszczek,
Dongarra [Ltaief:2013:HBR:2450153.2450154] add the second step (BND2BD)
and present a tiled algorithm for GE2VAL using GE2BND+BND2BD+BD2VAL.
In 2012, Ltaief, Luszczek, and Dongarra [Ltaief2012] improve the algorithm
for tall and skinny matrices by using
“any” tree instead of flat trees in the QR steps.
In 2012, Haidar,
Ltaief, Luszczek and Dongarra [6267821] improve the BND2BD step
of [Ltaief:2013:HBR:2450153.2450154]. Finally, in 2013, Haidar,
Kurzak, and Luszczek [Haidar:2013:IPS:2503210.2503292] consider the
problem of computing singular vectors (GESVD) by performing
GE2BND+BND2BD+BD2VAL+VAL2BD+BD2BND+BND2GE.
They show that the twostep approach
(from full to band, then band to bidiagonal) can be successfully used not only for
computing singular values, but also for computing singular
vectors.
BND2BD step. The algorithm in LAPACK for BND2BD is xGBBRD.
In 1996, Lang [LANG19961] improved the sequential version of the algorithm and developed a parallel distributed algorithm.
Recently, PLASMA released an efficient multithreaded implementation [6267821, Ltaief:2013:HBR:2450153.2450154].
We also note that Rajamanickam [piroband] recently worked on this step.
BD2VAL step. Much research has been done and is done on this kernel.
Much software exists.
In LAPACK, to
compute the singular values and optionally the singular vectors of
a bidiagonal matrix,
the routine xBDSQR uses the GolubKahan QR algorithm [doi:10.1137/0702016];
the routine xBDSDC uses the divideandconquer algorithm [doi:10.1137/S0895479892242232]; and
the routine xBDSVX uses bisection and inverse iteration algorithm.
Recent research was trying to apply the MRRR (Multiple Relatively Robust Representations) method [Willems2006] to the problem.
BND2BD+BD2VAL steps in this paper.
This paper does not focus neither on BND2BD nor BD2VAL. As far as we
are concerned, we can use any of the methods mentioned above. The faster these
two steps are, the better for us. For this study, during the experimental
section, for BND2BD, we use the PLASMA multithreaded
implementation [6267821, Ltaief:2013:HBR:2450153.2450154] and, for BD2VAL, we use LAPACK
xBDSQR.
Intel MKL. We note that, since 2014 and Intel MKL 11.2, the Intel MKL library is much faster to compute GESVD [intel]. Much of the literature on tiled bidiagonalization algorithm was before 2014, and so it was comparing to much slower version of MKL that is now available. The reported speedup in this manuscript for the tiled algorithms over MKL, are therefore much less impressive than previously reported.
We also note that, while there was a great performance improvement for
computing GESVD [intel] from Intel MKL 11.1 to Intel MKL 11.2, there was
no performance improvement for GEBRD. The reason is that the interface for
GEBRD prevents a twostep approach, while the interface of GESVD allows for a
twostep approach. Similarly, our twostep algorithm does not have the same
interface as LAPACK GEBRD. Consequently, it would not be fair to compare our
algorithm with LAPACK GEBRD or Intel MKL GEBRD because our algorithm does not
have a native LAPACK GEBRD interface. (I.e., it will not produce the
Householder vector column by column as requested by the LAPACK interface.) For
this reason, we will always compare with GESVD, and not GEBRD.
Approach to compute the SVD without going by bidiagonal form.
We note that going to bidiagonal form is not a necessary step to compute the
singular value decomposition of a matrix. There are approaches to compute the
SVD of a matrix which avoid the reduction to bidiagonal form. In the “direct
method” category, we can quote Jacobi
SVD [doi:10.1137/050639193, doi:10.1137/05063920X],
QDWHSVD [doi:10.1137/120876605]; one can also think to think the singular
value decomposition directly from the block bidiagonal form although not much
work has been done in this direction so far. Also, we can also refer to all
“iterative methods” for computing the singular values. The goal of these
iterative methods is in general to compute a few singular triplets.
As far as software packages, we can
quote
SVDPACK [SVDPACK],
PROPACK [PROPACK], and
PRIMME SVD [PRIMME_SVDS]. As far methods, we can quote [doi:10.1137/140979381, doi:10.1137/S1064827500372973, doi:10.1137/S0895479802404192].
Connection with Symmetric Eigenvalue Problem.
Many ideas (but not all) can be applied (and have been applied) to symmetric tridiagonalization. This paper only focuses on bidiagonalization algorithm but we believe that some of the tricks we show (but not all) could be used for creating an efficient symmetric tridiagonalization.3 Tiled Bidiagonalization Algorithms
In this section, we provide background on tiled algorithms for QR factorization, and then explain how to use them for the BiDiag and RBiDiag algorithms.
3.1 Tiled algorithms
Tiled algorithms are expressed in terms of tile operations rather than elementary operations. Each tile is of size , where is a parameter tuned to squeeze the most out of arithmetic units and memory hierarchy. Typically, ranges from to on stateoftheart machines [sc09agullo].
Consider a rectangular tiled matrix of size . The actual size of is thus , where is the tile size. We number rows from to , columns from to , and factorization steps from to . Algorithm 1 outlines a tiled QR algorithm, where loop indices represent tiles:
In Algorithm 1, is the step, and also the panel index, and is an orthogonal transformation that combines rows and to zero out the tile in position . We explain below how such orthogonal transformations can be implemented.
3.2 Kernels
To implement a given orthogonal transformation , one can use six different kernels, whose costs are given in Table 1. In this table, the unit of time is the time to perform floatingpoint operations.
Operation  Panel  Update  

Name  Cost  Name  Cost  
Factor square into triangle  4  6  
Zero square with triangle on top  6  12  
Zero triangle with triangle on top  2  6 
There are two main possibilities to implement an orthogonal transformation : The first version eliminates tile with the TS (Triangle on top of square) kernels, as shown in Algorithm 2:
Here the tile panel is factored into a triangle (with ). The transformation is applied to subsequent tiles , , in row (with ). Tile is zeroed out (with ), and subsequent tiles , , in row are updated (with ). The flop count is (expressed in same time unit as in Table 1). Dependencies are the following:
and can be executed in parallel, as well as operations on different columns . With an unbounded number of processors, the parallel time is thus timeunits.
The second approach to implement the orthogonal transformation is with the TT (Triangle on top of triangle) kernels, as shown in Algorithm 3. Here both tiles and are factored into a triangle (with ). The corresponding transformations are applied to subsequent tiles and , , in both rows and (with ). Tile is zeroed out (with ), and subsequent tiles , , in row are updated (with ). The flop count is , just as before. Dependencies are the following:
Now the factor operations in row and can be executed in parallel. Moreover, the updates can be run in parallel with the factorization. Thus, with an unbounded number of processors, the parallel time is timeunits.
3.3 QR factorization
Consider a rectangular tiled matrix of size , with . There are many algorithms to compute the QR factorization of , and we refer to [c177] for a survey. We use the three following variants:

This simple algorithm is the reference algorithm used in [Buttari2008, tileplasma]. At step , the pivot row is always row , and we perform the eliminations in sequence, for , down to . In this algorithm, TS kernels are used.

This algorithm is the counterpart of the FlatTS algorithm with TT kernels. It uses exactly the same elimination operations, but with different kernels.

This algorithm is asymptotically optimal, and turns out to be the most efficient on a variety of platforms [henc, j125]. An optimal algorithm is Grasap [henc, j125]. The difference between Grasap and Greedy is a small constant term. We work with Greedy in this paper. The main idea is to eliminate many tiles in parallel at each step, using a reduction tree (see [c177] for a detailed description).
3.4 Bidiagonalization
Consider a rectangular tiled matrix of size , with . The bidiagonalization algorithm BiDiag proceeds as the QR factorization, but interleaves one step of LQ factorization between two steps of QR factorization (see Figure 1). More precisely, BiDiag executes the sequence
where is the step of the algorithm (see Algorithm 1), and is the step of the algorithm. The latter is a right factorization step that executes the columnoriented eliminations shown in Algorithm 4.
In Algorithm 4, is an orthogonal transformation that combines columns and to zero out the tile in position . It is the exact counterpart to the roworiented eliminations and be implemented with the very same kernels, either TS or TT.
3.5 RBidiagonalization
When is much larger than , Rbidiagonalization should be preferred, if minimizing the operation count is the objective. This RBiDiag algorithm does a QR factorization of , followed by a bidiagonalization of the upper square matrix. In other words, given a rectangular tiled matrix of size , with , RBiDiag executes the sequence
Let and be the actual size of (element wise). The number of arithmetic operations is for BiDiag and for RBiDiag [Golub, p.284]. These numbers show that RBiDiag is less costly than BiDiag whenever , or equivalently, whenever . One major contribution of this paper is to provide a comparison of BiDiag and RBiDiag in terms of parallel execution time, instead of operation count.
3.6 Comments
Case .
The paper focuses on the case but everything holds in the case. One simply can transpose the initial matrix and change switch LQ steps for QR steps and vice versa.
Real/Complex arithmetic.
There is no restriction of our work on whether the matrices are real or complex. Actually, our codes have been generated to allow for both real and complex arithmetics, single and double precisions.
Memory usage of our algorithm.
In the standard LAPACK case, GEBRD takes as input a matrix and overrides the matrix with the Householder reflectors and the bidiagonal matrix. The algorithm is therefore down in place and there is no extra storage needed. In the case RBiDiag, it is important to note that as in the case of the LHC algorithm [Chan:1982:IAC:355984.355990], an extra storage of size is needed if one wants to recompute the singular vectors. (If one wants the singular value, all that matters is the bidiagonal form and any Householder vectors generated can safely be discarded.)
Any tree is possible.
We note that whether we choose to perform BiDiag or RBiDiag, any tree is possible at each step of the algorithm. For example a restriction of our study thereafter is to fix the trees in the QR step and the BiDiag step of RBiDiag to be the same. Clearly there is no such need, and one can consider for RBiDiag (for example) a combination like (QRFlatTS +BiDiagGreedy). We do not study such combinations. This also simplifies notations. Since we consider the same tree for the QR step and the BiDiag step of RBiDiag, when we write RBiDiagFlatTT (for example), we mean (QRFlatTT +BiDiagFlatTT).
BiDiagGreedy and BiDiagBinomial are the same.
Another name for BiDiagGreedy would be BiDiagBinomial. Since in the BiDiag algorithm, we repeat the same tree (in the Greedy case, a Binomial tree) over and over again. So BiDiagGreedy and BiDiagBinomial are the exact same algorithm. (Whereas QRGreedy is not the same as QRBinomial.)
4 Critical paths
In this section, we compute exact or estimated values of the critical paths of the
BiDiag and RBiDiag algorithms with the FlatTS, FlatTT, and Greedy trees.4.1 Bidiagonalization
Given a rectangular tiled matrix of size , with , the bidiagonalization algorithm BiDiag executes the sequence
To compute the critical path, we first observe that there is no overlap between two consecutive steps and . To see why, consider w.l.o.g. the first two steps and on Figure 1. Tile is used at the end of the step to update the last row of the trailing matrix (whichever it is). In passing, note that all columns in this last row are updated in parallel, because we assume unlimited resources when computing critical paths. But tile it is the first tile modified by the step, hence there is no possible overlap. Similarly, there is no overlap between two consecutive steps and . Consider steps and on Figure 1. Tile is used at the end of the step to update the last column of the trailing matrix (whichever it is), and it is the first tile modified by the step.
As a consequence, the critical path length of BiDiag is the sum of the critical path lengths of each QR and LQ step. And so an optimal BiDiag algorithm is an algorithm which is optimal at each QR and LQ step independently. We know that an optimal QR step is obtained by a binomial tree. We know that an optimal LQ step is obtained by a binomial tree. Therefore the scheme alternating binomial QR and LQ trees, that is BiDiagGreedy, is optimal. This is repeated in the following theorem.
Theorem 1.
BiDiagGreedy is optimal, in the sense that it has the shortest critical path length, over all BiDiag algorithms, which are tiled algorithms that alternate a QR step and an LQ step to obtain a band bidiagonal matrix with a sequence of any QR trees and any LQ trees.
We now compute critical path lengths. From [henc, c177, j125] we have the following values for the critical path of one QR step applied to a tiled matrix of size :
The critical path of one LQ step applied to a tiled matrix of size is . Finally, in the BiDiag algorithm, the size of the matrix for step is and the size of the matrix for step is . Altogether, we derive the following values:
We devote the whole next subsection to the study of the critical path of , because this is more complicated.
4.2 Study of the critical path of
4.2.1 A formula with a summation
By following the reasoning of the previous subsection, we get

(1)
The latter formula is exact and is easy enough to compute with a small computer code. However it does not provide us with much insight. To get a better grasp on , we develop some more Equation (1).
4.2.2 An exact formula when and are powers of two
First we note that:
(2)  
One way to understand the previous relation is to remember that “the” antiderative of is . The previous relation reminds us of this result. The general case ( is not a power of 2) of Equation (2) is
(3) 
From Equations (1) and (2), we derive that if is a power of two:
From Equations (1), (2) and (3), we derive that, if both and are powers of two, with :
Theorem 2.
Let and be powers of 2,
4.2.3 Using Stirling’s formula
We can also consider Equation (1
) again and obtain simpler bounds by rounding down and up the ceiling function in the logarithms and then using asymptotic analysis. We consider the bounds:
Applied to Equation (1), this leads to
We note that the difference between the lower bound (left side) and the upper bound (right side) is .
We now consider “large” and , and we use Stirling’s formula as follows:
And with one more term we get:
In base 2,
We obtain that
We stop the expansion a little earlier.
We now study the term: , this gives:
We now study the term: . We can set with . We get
We have that
We are therefore drawn to study . We see that this is a decreasing function of . We are interested in the limit value when goes to . For large values of , we have
So that
So all in all, we have that
And so
Thus
So we get
(4) 
We note that the difference between the lower bound (left side) and the upper bound (right side) is .
We now get the following theorem
Theorem 3.
This theorem matches the power of 2 case. Also we understand that we perform QR steps that each are done with a binomial tree of length . This explains the term; we perform LQ steps that each are done with a binomial tree of length . This explains the term. The coefficient 6 in front is due to the weights use for the kernels. This theorem holds for any and .
In the case, we get that
In the case, we get that
Comments
There are no comments yet.