1 Introduction
Our algorithm, like many of the multiplication algorithms before, relies on the DFT, the DFT is the major bottleneck in multiplication algorithms due to its’ time.
The DFT is a very useful algorithm. With the popularization of the FFT by Cooley and Tukey in 1965[4], it became much more widely used. Since then, there have been a few attempts at speeding up the DFT. In 2012 there was a major breakthrough with Hassanieh, Indyk, and Price’s sparse FFT (sFFT)[6], that gave a time of for exactly sparse algorithms and for the general case. One drawback of the algorithm is that it needs to be a power of 2.
In this paper, we present a DFT algorithm that uses only in operations. Like some forms of the DFT, it only works for certain sizes, although it is much less limited than many. Then we use this to multiply two naturals in time.
We note that it was Karatsuba, who in 1962, first improved on the naïve bound of to .
Besides the straight line program asymptotic limit [10], Schönhage and Strassen conjectured in 1971 that integer multiplication must take time [12]. In that same paper, they set the world record at . That bound stood for almost 40 years, until Fürer got the new bounds , where is some constant and is the iterated logarithm[5]. Finally, there were a number of results, particularly with a number of impressive papers from Harvey and van der Hoeven, that culminated in their time algorithm[7].
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
2 Polynomial Transform Overview
We use the multitape Turing machine as our model of computation.
The basic idea behind the transform is to transform or convert what is essentially something like one DFT into many DFTs that are much smaller.
We start with numbers in the field , which is basically to say that we will be working with integers modulo . The next step is to extend this field to , which means that we will use the roots of the polynomial similarly to how the imaginary number is used in complex numbers. The key idea is that we will use a polynomial of degree ,^{1}^{1}1or possibly higher degree so that the system will have at least 3 coefficients or components modulo .
To use , we need an irreducible polynomial modulo . Then, the trick is to find other primes , for varying , where the polynomial is completely reducible. In other words, it factors into three monomials. The idea is then that these roots then become “numbers”, so that our 3 component or 3 coefficient system becomes one number modulo the various . We’ll call this component reduction.
We can demonstrate this component reduction with complex numbers. Modulo , we can take the imaginary number , and square it to get . And . But , and . So any complex number, modulo reduces to or . The idea is then to use both versions, and use linear algebra to find the components modulo .
An example is . First, setting yields . Then, setting yields . Now we use linear algebra (modulo ) to recover the real and imaginary components:
(2.1) 
From this matrix we recover real coefficient and imaginary coefficient .
The idea is essentially the same for the polynomial transform. We use 3 irreducible roots modulo our modulus , which allows a system of 3 components or coefficients, using a polynomial of degree at least 3. Then we find primes , for varying
, where the system of vectors, components, coefficients, or whatever you want to call them reduces to a single value modulo
. Thus, the system will be much smaller modulo , since everything reduces to one number, and thus the system, which has elements modulo has elements modulo .Now, since we find a few different values to use, we can use the Chinese remainder on the coefficients modulo , to recover them modulo if we are very careful. The idea is that a component system will have a maximum value, from one multiplication modulo , of for each coefficient. This comes from the fact that each coefficient is at most , so that when we multiply 2 coefficients together, we get at most as the value, and since there are different coefficients, we multiply each coefficient by 3. We are not done, because to convert these 3 components into the reduced single value modulo , we must multiply each again by at most and then sum the coefficients, to get a total value of size . That gives a single coefficient. Now, to be safe, we want our system to handle all of the coefficients summed into one value, so this means that we have coefficients of size at most summed together into one coefficient modulo , which means that our coefficients are at most .
Now, we can use the Chinese remainder theorem to ensure that we can recover our coefficients, if the primes, when combined together, can handle an exact value of at most . In other words, we want .
We note that the order of the roots that we use does not matter when we combine our matrices, since the values of the roots are changed, but will still give the same result. However, we know which root is and which is , so that we technically order them.
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
2.1 An Example
We can demonstrate the idea by showing how we’d perform a polynomial transform of size . We first note that is a prime, and also that the polynomial is irreducible modulo , According to Mathematica 11.3.0.0. So we can use this polynomial to construct , and then use as the finite field that we’ll do our polynomial transform, or DFT over.
Now, modulo , . So our three roots exist equivalently as integer values modulo .
Similar situations occur for , , , , , and . Their product is greater than , so that they can represent any value of the DFT.^{2}^{2}2This is true as long as only one multiplication on each coefficient is performed. Surely we can perform a DFT modulo each of these primes in time . This also means that we only need to multiply each coefficient by one power of , where our output of DFT is , our inputs are , and
(2.2) 
Thus, if we sum the squares of our primes for each , the hope is that the total running time of all of our miniature DFTs is less than time , the time of the size polynomial transform. This is indeed the case.
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
2.2 An Easier example
The polynomial that contains all of the cube roots of is , so that we already know the roots of the polynomial, and we can quickly and easily match the roots together. Thus, once we find modulo , we can easily find modulo .
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
3 Running Time Analysis
A sketch of the algorithm is shown in Figure 1.
We can examine the preprocessing first. Our first task is to find the prime . For this we can use the sieve of Atkin and Bernstein[1]. It takes arithmetic operations, for a running time of , assuming that each operation takes time , since we don’t use any numbers larger than bits for this task. Since we’ll use a transform size of , this can certainly be done in time linear in the transform size of .
After we have found , we pick one value for our cube root , and this gives us . We have to ensure that this root does not exist in already. So we cycle through values until we find one that does not exist already. This happens of the time for each choice of with . In fact, we must ensure that , because is not a prime when , and if , then every element in this field has a cube root. We can see this through Fermat’s little theorem. We have
(3.1)  
(3.2)  
(3.3)  
(3.4) 
Where the first two equations are Fermat’s little theorem, and the last two follow from multiplying each side of the first two together. If we then substitue for , we get
(3.5)  
(3.6)  
(3.7) 
This is taken from [3].
We can find the various cube roots by cubing each value from to modulo which takes 2 multiplications per value, and thus time per prime . Since we’re only interested in storing the two cube roots of , this takes space. Here, we simply want to find two cube roots that do not exist modulo , so that we can use them as field extensions modulo . They should both exist modulo all primes that we use, other than , so that we can use linear algebra on them (as described above). All other primes that we find, we can start with primes close to
and proceed towards smaller and smaller primes. That the probabilities that values modulo some prime
are cubes is roughly equal, according to [11]. This ensures us that we can find a cube root with probability roughly . Also, if we use a circulant matrix, which has the form given in Figure 1, where all diagonal entries are the same, we have the the probability that the matrix created from the corresponding cube roots is singular is approximately . Thus we should be able to find a set of usable primes fairly easily, since the number of primes is constant. This follows because they are all fairly close to ; that is to say . Also, the circulant matrix is known to have determinant(3.8) 
This equation will equal modulo approximately of the time, since the entries or coefficients of the matrix are again evenly distributed. Thus as goes to infinity, so do all of the potential values that we use, and thus the probability becomes smaller that the matrices are singular. Namely, it approaches .
Now, to find the set of usable primes, we should know that this is equivalent to finding an integer that function as the cube root of modulo , and as we saw above, this takes time time per prime. Now, the asymptotic formula for this, according to [9], is
(3.9) 
This is certainly done within time.
Finally, we will address smaller DFTs here and in the next subsection. For each prime , we compute each individual power of to associate with the input coefficients and rows and columns of each DFT matrix. This takes operations, which is certainly in , as each prime that we use will be less than .
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
3.1 The algorithm
The first step of the algorithm is to collect all of the coefficients modulo each prime . What we mean by this is in Figure 2 and Figure 3. Performing the DFT is equivalent to multiplying a Vandermonde matrix by a vector. This equivalent Vandermonde matrix is the matrix in Figure 2, with replaced by , where is an element with multiplicative order , for an element DFT.
Now, if we take our element Polynomial Transform, which is essentially like an element DFT, then the matrix equivalent of the Polynomial transform is the matrix in Figure 3. Further, modulo our primes , for all , this matrix becomes a Vandermonde matrix. Here we can get the correct Vandermonde matrix by adding together the coefficients that are matched with each element , for all . Note that the Vandermonde matrices (columns) may stop abruptly, immediately after the last coefficient.
Thus, our task for each is to sum the coefficients that match into each column of our various Vandermonde matrices. We can easily put the first values into a linked list of values, and then cycle back to the start of the list. Then add the second values to our first values, and cycle back. In total, this takes at most time proportional to . We do this for a constant number of primes, as we have previously shown, and each addition takes at most time . Thus the total time for this operation is time .
The DFT is next calculated for outputs modulo , which are the outputs of the Vandermonde matrices modulo each prime . This takes operations for each prime , and according to [8], this sums to
(3.10)  
(3.11) 
And so we can do this in multiplications. We’ll handle this with recursion, soon. Presently we will say that with a constant number of primes close to , we can easily use the Chinese remainder theorem on our DFTs modulo each . This will take a constant number of multiplications of numbers of size bits. Also we have to cycle through our lists of DFT outputs modulo each more than once, but this won’t affect our running time.
We do this fairly easily, as we will see. Set to be output number for the DFT done modulo . Then the final result of the entire Polynomial Transform is easily seen to be , which is
(3.12)  
(3.13)  
(3.14)  
(3.15) 
for all primes.
Now we can handle the recursion. We will say that the initial multiplication has bits, and we can write , in order to write the recursive multiplications of size into the equation. But this recursion is handled very closely by the iterated logarithm function, yielding time and space , for some undetermined . This is our final running time. We note that the space considerations can probably be reduced substantially.
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
4 Correctness
According to [2], for performing the FFT in a ring, for instance modulo , it suffices to use a principal root of unity , if has a reciprocal in this ring. The criterion is that the th principal root of unity is such that:
(4.1)  
(4.2) 
This, of course, carries over to the DFT and hence the Polynomial transform. Also, all elements of a finite field are elements of a ring, and so we can surely find a principal root of unity that possesses this property.
[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]
5 acknowledgments
This paper wouldn’t be possible without the help of people on StackExchange.com and the math software Mathematica . In particular, I’d like to thank Jyrki Lahtonen and David Harvey for their help.
References
 [1] A.O.L. Atkin and D.J. Bernstein. Prime sieves using binary quadratic forms. Mathematics of Computation, 73(246):1023–1030, December 2003.
 [2] Dario Bini and Victor Pan. Fundamental Algorithms, volume 1 of Polynomial and Matrix Computations. Birkhäuser, 1994.
 [3] Wikipedia contributors. Cubic reciprocity — wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Cubic_reciprocity, March 2019.
 [4] James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, 1965.
 [5] Martin Fürer. Faster integer multiplication. pages 57–66, June 2007. MR 2402428 (2009e:68124).

[6]
Dina Katabi Haitham Hassanieh, Piotr Indyk and Eric Price.
Nearly optimal sparse fourier transform, 2012.
 [7] David Harvey and Joris van der Hoeven. Integer multiplication in time o(n log n). https://hal.archivesouvertes.fr/hal02070778, March 2019.
 [8] Eric Naslund (https://math.stackexchange.com/users/6075/eric naslund). How does grow asymptotically for ? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/49434 (version: 20170413).
 [9] Marty Cohen (https://math.stackexchange.som/users/13079/marty cohen. Calculate the limit (here p is a prime). Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/429219 (version: 20130625).
 [10] Michael Clausen Peter Bürgisser and Mohammed Amin Shokrollahi. Algebraic Complexity Theory, volume 315. Springer, 1997.
 [11] reuns (https://math.stackexchange.som/users/276986/reuns. Are cube roots evenly distributed modulo primes? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3459391 (version: 20191202).
 [12] Arnold Schönhage and VolkerStrassen. Schnelle multiplikation grosser zahlen. Computing, 7:281–292, 1971.
Comments
There are no comments yet.