# The Polynomial Transform

We explore a new form of DFT, which we call the Polynomial Transform. It functions over finite fields,or DFTs, of size m3 for m prime and of the form 3l+1 for some natural l. It uses only O(m3) arithmetic operations on numbers of size log m, and takes total time NKlogN , for N = Θ(m3 log m).

## Authors

• 2 publications
• ### Faster integer and polynomial multiplication using cyclotomic coefficient rings

We present an algorithm that computes the product of two n-bit integers ...
12/11/2017 ∙ by David Harvey, et al. ∙ 0

• ### A multidimensional analog to the Burrows-Wheeler transform

We show how to perform multidimensional pattern matching over an n-dimen...
05/09/2019 ∙ by Abhinav Nellore, et al. ∙ 0

• ### Number of New Top 2

In this paper we compare the numbers of new top 2 USA annually since 198...
11/18/2021 ∙ by Lei Liu, et al. ∙ 0

• ### Faster Integer Multiplication Using Preprocessing

A New Number Theoretic Transform(NTT), which is a form of FFT, is introd...
11/17/2019 ∙ by Matt Groff, et al. ∙ 0

• ### Invariance And Inner Fractals In Polynomial And Transcendental Fractals

A lot of formal and informal recreational study took place in the fields...
09/30/2012 ∙ by Nabarun Mondal, et al. ∙ 0

• ### Fast Multi-Subset Transform and Weighted Sums Over Acyclic Digraphs

The zeta and Moebius transforms over the subset lattice of n elements an...
02/19/2020 ∙ by Mikko Koivisto, et al. ∙ 0

• ### POLAR: A Polynomial Arithmetic Framework for Verifying Neural-Network Controlled Systems

We propose POLAR, a polynomial arithmetic framework that leverages polyn...
06/25/2021 ∙ by Chao Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Our algorithm, like many of the multiplication algorithms before, relies on the DFT, the DFT is the major bottleneck in multiplication algorithms due to its’ time.

The DFT is a very useful algorithm. With the popularization of the FFT by Cooley and Tukey in 1965[4], it became much more widely used. Since then, there have been a few attempts at speeding up the DFT. In 2012 there was a major breakthrough with Hassanieh, Indyk, and Price’s sparse FFT (sFFT)[6], that gave a time of for exactly -sparse algorithms and for the general case. One drawback of the algorithm is that it needs to be a power of 2.

In this paper, we present a DFT algorithm that uses only in operations. Like some forms of the DFT, it only works for certain sizes, although it is much less limited than many. Then we use this to multiply two naturals in time.

We note that it was Karatsuba, who in 1962, first improved on the naïve bound of to .

Besides the straight line program asymptotic limit [10], Schönhage and Strassen conjectured in 1971 that integer multiplication must take time [12]. In that same paper, they set the world record at . That bound stood for almost 40 years, until Fürer got the new bounds , where is some constant and is the iterated logarithm[5]. Finally, there were a number of results, particularly with a number of impressive papers from Harvey and van der Hoeven, that culminated in their time algorithm[7].

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 2 Polynomial Transform Overview

We use the multitape Turing machine as our model of computation.

The basic idea behind the transform is to transform or convert what is essentially something like one DFT into many DFTs that are much smaller.

We start with numbers in the field , which is basically to say that we will be working with integers modulo . The next step is to extend this field to , which means that we will use the roots of the polynomial similarly to how the imaginary number is used in complex numbers. The key idea is that we will use a polynomial of degree ,111or possibly higher degree so that the system will have at least 3 coefficients or components modulo .

To use , we need an irreducible polynomial modulo . Then, the trick is to find other primes , for varying , where the polynomial is completely reducible. In other words, it factors into three monomials. The idea is then that these roots then become “numbers”, so that our 3 component or 3 coefficient system becomes one number modulo the various . We’ll call this component reduction.

We can demonstrate this component reduction with complex numbers. Modulo , we can take the imaginary number , and square it to get . And . But , and . So any complex number, modulo reduces to or . The idea is then to use both versions, and use linear algebra to find the components modulo .

An example is . First, setting yields . Then, setting yields . Now we use linear algebra (modulo ) to recover the real and imaginary components:

 [3(1)=31(2)=23(1)=31(3)=3][1i]=[01] (2.1)

From this matrix we recover real coefficient and imaginary coefficient .

The idea is essentially the same for the polynomial transform. We use 3 irreducible roots modulo our modulus , which allows a system of 3 components or coefficients, using a polynomial of degree at least 3. Then we find primes , for varying

, where the system of vectors, components, coefficients, or whatever you want to call them reduces to a single value modulo

. Thus, the system will be much smaller modulo , since everything reduces to one number, and thus the system, which has elements modulo has elements modulo .

Now, since we find a few different values to use, we can use the Chinese remainder on the coefficients modulo , to recover them modulo if we are very careful. The idea is that a component system will have a maximum value, from one multiplication modulo , of for each coefficient. This comes from the fact that each coefficient is at most , so that when we multiply 2 coefficients together, we get at most as the value, and since there are different coefficients, we multiply each coefficient by 3. We are not done, because to convert these 3 components into the reduced single value modulo , we must multiply each again by at most and then sum the coefficients, to get a total value of size . That gives a single coefficient. Now, to be safe, we want our system to handle all of the coefficients summed into one value, so this means that we have coefficients of size at most summed together into one coefficient modulo , which means that our coefficients are at most .

Now, we can use the Chinese remainder theorem to ensure that we can recover our coefficients, if the primes, when combined together, can handle an exact value of at most . In other words, we want .

We note that the order of the roots that we use does not matter when we combine our matrices, since the values of the roots are changed, but will still give the same result. However, we know which root is and which is , so that we technically order them.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 2.1 An Example

We can demonstrate the idea by showing how we’d perform a polynomial transform of size . We first note that is a prime, and also that the polynomial is irreducible modulo , According to Mathematica 11.3.0.0. So we can use this polynomial to construct , and then use as the finite field that we’ll do our polynomial transform, or DFT over.

Now, modulo , . So our three roots exist equivalently as integer values modulo .

Similar situations occur for , , , , , and . Their product is greater than , so that they can represent any value of the DFT.222This is true as long as only one multiplication on each coefficient is performed. Surely we can perform a DFT modulo each of these primes in time . This also means that we only need to multiply each coefficient by one power of , where our output of DFT is , our inputs are , and

 ˆzk=n∑j=1zjωjk (2.2)

Thus, if we sum the squares of our primes for each , the hope is that the total running time of all of our miniature DFTs is less than time , the time of the size polynomial transform. This is indeed the case.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 2.2 An Easier example

The polynomial that contains all of the cube roots of is , so that we already know the roots of the polynomial, and we can quickly and easily match the roots together. Thus, once we find modulo , we can easily find modulo .

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 3 Running Time Analysis

A sketch of the algorithm is shown in Figure 1.

We can examine the preprocessing first. Our first task is to find the prime . For this we can use the sieve of Atkin and Bernstein[1]. It takes arithmetic operations, for a running time of , assuming that each operation takes time , since we don’t use any numbers larger than bits for this task. Since we’ll use a transform size of , this can certainly be done in time linear in the transform size of .

After we have found , we pick one value for our cube root , and this gives us . We have to ensure that this root does not exist in already. So we cycle through values until we find one that does not exist already. This happens of the time for each choice of with . In fact, we must ensure that , because is not a prime when , and if , then every element in this field has a cube root. We can see this through Fermat’s little theorem. We have

 yqk ≡xmodqk (3.1) yqk−1 ≡1modqk (3.2) yqkyqk−1 ≡xmodqk (3.3) y2qk−1 ≡xmodqk (3.4)

Where the first two equations are Fermat’s little theorem, and the last two follow from multiplying each side of the first two together. If we then substitue for , we get

 y2qk−1 ≡xmodqk (3.5) y2(3r+2)−1 ≡xmodqk (3.6) y6r+3 ≡xmodqk (3.7)

This is taken from [3].

We can find the various cube roots by cubing each value from to modulo which takes 2 multiplications per value, and thus time per prime . Since we’re only interested in storing the two cube roots of , this takes space. Here, we simply want to find two cube roots that do not exist modulo , so that we can use them as field extensions modulo . They should both exist modulo all primes that we use, other than , so that we can use linear algebra on them (as described above). All other primes that we find, we can start with primes close to

and proceed towards smaller and smaller primes. That the probabilities that values modulo some prime

are cubes is roughly equal, according to [11]. This ensures us that we can find a cube root with probability roughly . Also, if we use a circulant matrix, which has the form given in Figure 1, where all diagonal entries are the same, we have the the probability that the matrix created from the corresponding cube roots is singular is approximately . Thus we should be able to find a set of usable primes fairly easily, since the number of primes is constant. This follows because they are all fairly close to ; that is to say . Also, the circulant matrix is known to have determinant

 qk−1∏j=0qk−1∑l=0ckωjkmodqk % where ω=e2πi/(qk−1) (3.8)

This equation will equal modulo approximately of the time, since the entries or coefficients of the matrix are again evenly distributed. Thus as goes to infinity, so do all of the potential values that we use, and thus the probability becomes smaller that the matrices are singular. Namely, it approaches .

Now, to find the set of usable primes, we should know that this is equivalent to finding an integer that function as the cube root of modulo , and as we saw above, this takes time time per prime. Now, the asymptotic formula for this, according to [9], is

 ∑qk Primeqk

This is certainly done within time.

Finally, we will address smaller DFTs here and in the next subsection. For each prime , we compute each individual power of to associate with the input coefficients and rows and columns of each DFT matrix. This takes operations, which is certainly in , as each prime that we use will be less than .

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 3.1 The algorithm

The first step of the algorithm is to collect all of the coefficients modulo each prime . What we mean by this is in Figure 2 and Figure 3. Performing the DFT is equivalent to multiplying a Vandermonde matrix by a vector. This equivalent Vandermonde matrix is the matrix in Figure 2, with replaced by , where is an element with multiplicative order , for an element DFT.

Now, if we take our element Polynomial Transform, which is essentially like an element DFT, then the matrix equivalent of the Polynomial transform is the matrix in Figure 3. Further, modulo our primes , for all , this matrix becomes a Vandermonde matrix. Here we can get the correct Vandermonde matrix by adding together the coefficients that are matched with each element , for all . Note that the Vandermonde matrices (columns) may stop abruptly, immediately after the last coefficient.

Thus, our task for each is to sum the coefficients that match into each column of our various Vandermonde matrices. We can easily put the first values into a linked list of values, and then cycle back to the start of the list. Then add the second values to our first values, and cycle back. In total, this takes at most time proportional to . We do this for a constant number of primes, as we have previously shown, and each addition takes at most time . Thus the total time for this operation is time .

The DFT is next calculated for outputs modulo , which are the outputs of the Vandermonde matrices modulo each prime . This takes operations for each prime , and according to [8], this sums to

 ∑qk Primeqk

And so we can do this in multiplications. We’ll handle this with recursion, soon. Presently we will say that with a constant number of primes close to , we can easily use the Chinese remainder theorem on our DFTs modulo each . This will take a constant number of multiplications of numbers of size bits. Also we have to cycle through our lists of DFT outputs modulo each more than once, but this won’t affect our running time.

We do this fairly easily, as we will see. Set to be output number for the DFT done modulo . Then the final result of the entire Polynomial Transform is easily seen to be , which is

 widehatxj =widehatxjmodq1(1) (3.12) +widehatxjmodq2(2) (3.13) +widehatxjmodq3(3) (3.14) +⋮ +widehatxjmodqα(α) (3.15)

for all primes.

Now we can handle the recursion. We will say that the initial multiplication has bits, and we can write , in order to write the recursive multiplications of size into the equation. But this recursion is handled very closely by the iterated logarithm function, yielding time and space , for some undetermined . This is our final running time. We note that the space considerations can probably be reduced substantially.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 4 Correctness

According to [2], for performing the FFT in a ring, for instance modulo , it suffices to use a principal root of unity , if has a reciprocal in this ring. The criterion is that the th principal root of unity is such that:

 ωn=1 (4.1) n−1∑i=0ωij=0,i=1,2,3,…,n−1 (4.2)

This, of course, carries over to the DFT and hence the Polynomial transform. Also, all elements of a finite field are elements of a ring, and so we can surely find a principal root of unity that possesses this property.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 5 acknowledgments

This paper wouldn’t be possible without the help of people on StackExchange.com and the math software Mathematica . In particular, I’d like to thank Jyrki Lahtonen and David Harvey for their help.

## References

• [1] A.O.L. Atkin and D.J. Bernstein. Prime sieves using binary quadratic forms. Mathematics of Computation, 73(246):1023–1030, December 2003.
• [2] Dario Bini and Victor Pan. Fundamental Algorithms, volume 1 of Polynomial and Matrix Computations. Birkhäuser, 1994.
• [3] Wikipedia contributors. Cubic reciprocity — wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Cubic_reciprocity, March 2019.
• [4] James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, 1965.
• [5] Martin Fürer. Faster integer multiplication. pages 57–66, June 2007. MR 2402428 (2009e:68124).
• [6] Dina Katabi Haitham Hassanieh, Piotr Indyk and Eric Price.

Nearly optimal sparse fourier transform, 2012.

• [7] David Harvey and Joris van der Hoeven. Integer multiplication in time o(n log n). https://hal.archives-ouvertes.fr/hal-02070778, March 2019.
• [8] Eric Naslund (https://math.stackexchange.com/users/6075/eric naslund). How does grow asymptotically for ? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/49434 (version: 2017-04-13).
• [9] Marty Cohen (https://math.stackexchange.som/users/13079/marty cohen. Calculate the limit (here p is a prime). Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/429219 (version: 2013-06-25).
• [10] Michael Clausen Peter Bürgisser and Mohammed Amin Shokrollahi. Algebraic Complexity Theory, volume 315. Springer, 1997.
• [11] Are cube roots evenly distributed modulo primes? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3459391 (version: 2019-12-02).
• [12] Arnold Schönhage and Volker-Strassen. Schnelle multiplikation grosser zahlen. Computing, 7:281–292, 1971.