# Faster Integer Multiplication Using Preprocessing

A New Number Theoretic Transform(NTT), which is a form of FFT, is introduced, that is faster than FFTs. Also, a multiplication algorithm is introduced that uses this to perform integer multiplication faster than O(n log n). It uses preprocessing to achieve an upper bounds of (n log n/(log log n/ log log log n). Also, we explore the possibility of O(n) time multiplication via NTTs that require only O(n) operations, using preprocessing.

Comments

There are no comments yet.

## Authors

• 2 publications
11/22/2016

### Faster integer multiplication using plain vanilla FFT primes

Assuming a conjectural upper bound for the least prime in an arithmetic ...
11/05/2018

### Putting Fürer Algorithm into Practice with the BPAS Library

Fast algorithms for integer and polynomial multiplication play an import...
02/22/2018

### Faster integer multiplication using short lattice vectors

We prove that n-bit integers may be multiplied in O(n log n 4^log^* n)...
06/09/2018

### A fast algorithm for solving linearly recurrent sequences

We present an algorithm which computes the D^th term of a sequence satis...
12/03/2019

### The Polynomial Transform

We explore a new form of DFT, which we call the Polynomial Transform. It...
03/02/2017

### Faster truncated integer multiplication

We present new algorithms for computing the low n bits or the high n bit...
02/05/2019

### Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries

On common processors, integer multiplication is many times faster than i...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Our algorithm, like many of the multiplication algorithms before, relies on the DFT, and that DFT is now the major bottleneck in multiplication algorithms due to its’ time.

The DFT is a very useful algorithm. With the popularization of the FFT by Cooley and Tukey in 1965[6], it became much more widely used. Since then, there have been a few attempts at speeding up the DFT. In 2012 there was a major breakthrough with Hassanieh, Indyk, and Price’s sparse FFT (sFFT)[8], that gave a time of for exactly -sparse algorithms and for the general case. One drawback of the algorithm is that it needs to be a power of 2.

In this paper, we present a DFT algorithm that uses only in operations, but it must use preprocessing. Like some forms of the DFT, it only works for certain sizes, although it is much less limited than many. Then we use this to multiply two naturals in the same time. Essentially, we prove that integer multiplication can be done in time slightly faster than .

We note that it was Karatsuba, who in 1962, first improved on the naïve bound of to .

Besides the straight line program asymptotic limit [12], Schönhage and Strassen conjectured in 1971 that integer multiplication must take time [14]. In that same paper, they set the world record at . That bound stood for almost 40 years, until Fürer got the new bounds , where is some constant and is the iterated logarithm[7]. Finally, there were a number of results, particularly with a number of impressive papers from Harvey and van der Hoeven, that culminated in their time algorithm[9].

It should be noted that in this paper, the focus is on the DFT, so we use for the size of the DFT, for the size of the multiplication, and we do calculations modulo a prime .

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 2 Algorithm Overview

The algorithm gets its’ speed advantage from the ability to solve small DFTs in linear number of operations, via table lookup. The basic idea is then to use divide-and-conquer methodology to transform each size DFT into many DFTs of approximately size . After enough recursion, the algorithm reaches DFTs that are sufficiently small enough. The algorithm uses another table to convert each input coefficient into many coefficients of much smaller size, using the Chinese remainder theorem. Then, the algorithm uses a table, as mentioned in the beginning, to lookup the result of this much smaller DFT. It uses one more table to combine the smaller outputs into outputs of the original size, again via Chinese remainder theorem. The recursion is finished and the final result is output.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 2.1 Recursion Primer

This section borrows liberally from [3].

We can write the DFT as the following formula:

 ˆxk=n−1∑j=0xjωjk (2.1)

Here the th input is , and the th output is . represents an th principal root of unity, which is necessary for the DFT to function correctly.

We can now rewrite , and rewrite our function to accommodate recursion. We set and . Then our original function becomes

 ˆxk =n−1∑j=0xjωjk (2.2) =n2−1∑j0=0n2−1∑j1=0xj1n1+j0jω(j1n1+j0)j(k2n2+k0)k (2.3) =n2−1∑j0=0ωj0k0twiddlefactors(n2−1∑j1=0xj1n1+j0ωn1j1k0)Inner DFTωn2j0k2 (2.4)

This shows how we can express one DFT as many DFTs of two particular sizes, namely and . Splitting up one transform into more than one transform (size) is known as the multidimensional DFT.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 2.2 Tabular DFT

Currently, for our DFTs, we perform calculations in a finite field, and we will say that we do most calculations modulo a prime , with .. This is known as the number theoretic transform.

We will find that the size of the prime that we need to use will be , where is the actual size of the DFT, and is a constant known as Linnik’s constant. In order to use tables to find the DFT result more quickly, we want to reduce our calculations from modulo to modulo a much smaller value, say mod . This will ensure that we can multiply two bit numbers together to get a number less than , which means that we only need a table of entries in it to record all possible multiplications of numbers of this size. This will allow us to have a maximum of bits per entry, and therefor the table won’t be larger than total bits. We can then call our temporary “word” size for this algorithm as .

Now that we know the word size, the rest is more straightforward. First, to multiply two values and together modulo , we split them up into and and

 (a)⋅(b) = (2.5) (∑jaj(4√N)j)(∑kbk(4√N)k) = ∑j+k=ℓajbk(4√N)ℓ (2.6)

We can simplify Equation 2.6 by focusing on calculating . If we use a table for each from to , we can calculate any possible value we will need. This is because there are at most powers of in and , and their product requires, at most, double this number. Now this number is an exact value modulo , which again is . We record exactly this value, modulo , in our table.

Since we are adding together numbers to get our result, we can reduce them to a value, modulo , with comparisons. This is because the final number, itself, will be at most , due to adding results from multiplication together. So we compare this number to , and if is is greater, then we subtract . Next, compare to . If it is greater, subtract . We repeat this times, which is a constant. Thus we can multiply two values less than in a constant number of arithmetic operations.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 2.3 The “Leaves”

At the innermost level of the recursion, the algorithm essentially converts coefficients modulo into much smaller coefficients, via Chinese remainder theorem. We’ll call the maximum (smaller) prime . The essential idea is that the algorithm has a coefficient of a DFT as a number, which we can write as

 c=x0N0⋅1/4+x1N1⋅1/4+⋯+x4L−1N(4L−1)⋅1/4 (2.7)

Using the same logic as Section 2.2, the algorithm will use tables, so that is assigned a specific value for each of the smaller primes equal to or less than . In other words, , for some small prime , is one and only one value. We can further assign one and only one value. Thus the algorithm can convert any coefficient into a system of primes via Chinese remainder theorem. This requires no calculation during the actual algorithm, only calculation during the preprocessing. Thus the time it takes is proportional to the size of the coefficient , assuming constant time to access any bit of memory.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 3 Algorithm Details

So, essentially we start with a multiplication algorithm that uses a number theoretic transform, or NTT, which is essentially a DFT in a finite field. Our goal is to multiply two -bit naturals, and we will assume that is approximately equal to , with more details to be described soon. We used just to simplify the calculatios with . One important thing to note is that our algorithm will need some precalculated information, and so we can perhaps describe it better as a circuit.

Our first real step towards finding some of the parameters that the algorithm will use is to determine the size of the largest of our small primes , as noted above in Section 2.3, which is used when we break the coefficients modulo into coefficients modulo smaller primes such as . We use as the size of the maximum (smaller) prime. To start, we’ll assume that we can’t use any tables larger than in total size, so that we are guaranteed not to take up too much time or space. We now need to know how big the Chinese remainder theorem will allow us to make .

We know that we can use the product of the smallest primes to compute a value modulo one large prime by using the Chinese remainder theorem. Assuming that the smallest DFT that we use will have a size , then we need to calculate a value of size . This is because each output coefficient takes an input that can range from to , and multiplies it by another value, , raised to some power, modulo again. This gives us , which there are of these since we’ve assumed that the (small) DFT is of size . The primorial, , is defined to be the product of the first primes, and setting and fining will give us the bits in the largest of the smaller primes, . According to [13], we have for , as an upper bound on the primorial, that

 log(x#) >x(1−12logx) (3.1) >x1−o(1) (3.2)

This gives

 x1−o(1)>P2m (3.3)

We also know that (the log of) the sum of the first primes times gives us the bitsize of each entry of the tables, since we must have an entry for each prime times the number of entries, . We have a total of total entries,111Here Prime means the th prime. since there are at most entries for each prime in the range from to . We also know, from [15], that the sum of the first primes is

 x∑z=1Prime(z)=x22logx+O(x2logx2) (3.4)

Further, we have that for the sum of the th powers of a prime, we have a formula from [10]

 x∑z=1Prime(z)k=li(xk+1)+O(xk+1e−c√logx) (3.5)

Here “li” is the offset or Eulerian logarithmic integral, as defined in [4]

 li(x) =∫x2dulogu (3.6) =O(xlogx) (3.7)

Collecting functions, we have

 x∑z=1Prime(z)k =li(xk+1)+O(xk+1e−c√logx) (3.8) =O(xk+1logxk+1)+O(xk+1e−c√logx) (3.9)

From Equation 3.3

 x1−o(1) >P2m (3.10) x >P2m (3.11)

To get the size of all of our DFT tables, we set to be larger than the size of all of the tables, as demonstrated in Figure 1.

From here, we can convert the inequality to a limit, taken from [11]:

 2N/c>(P2m)m (3.30) limn→∞2n/c(log(n10m))m (3.31)

This limit is calculated in Figure 2, where we take the limit as approaches . When , it is clear that the limit goes to infinity. When , the subtracted term is greater than and the limit goes to .

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 3.1 Recursion Details

For a DFT of size , we will use recursive calls of size ; more specifically, as or . This enables us to get as close to as possible, but at the expense of only being able to use certain DFT sizes. Since we are always using as the size of the inductive step of the recursion, we will make the original DFT of size , where or is again the size of the base case DFT, and is the size of the original problem. It should be fairly obvious that as grows larger, gets relatively closer to . It was already shown how grows as grows, for example, in Equation 3.25.

In addition to changing the size of as varies, we can also set . This just simply gives us more precise control of the size of , since the ratio between similar sizes of is reduced to approximately .

We next want to show how the speed of the base case affects the speed of the overall algorithm. Ordinarily, a size FFT takes time , but our base case takes linear time, or . For a size algorithm, there are approximately DFTs of size . So this takes time . For , this takes time

 2√m4(⋅2m⋅mlogm/logm) = (3.32) 2m2⋅2m⋅mlogm/logm = (3.33) 4m4logm/logm = (3.34) m4(4logm)/logm = (3.35) m4(logm4)/logm (3.36)

In general, the algorithm takes time . We’ve already shown this for the base cases and . So to prove this we use the induction. Let the sized DFT (our algorithm) take time . Then, we have for a size algorithm

 2√N2(time for N FFT) = (3.37) 2N(⋅NlogN/logm) = (3.38) 2m2k(m2klogm2k/logm) = (3.39) 2m2k+1logm2k/logm = (3.40) m2k+12logm2k/logm = (3.41) m2k+1logm2k+1/logm (3.42)

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 3.2 Base Case Details

The algorithm will proceed to use recursion with a base case of size , and we have already shown how large can get before the table sizes are too large (i.e. bigger than ). However, we want for some and . So first we find . That is to say, we find the appropriate power of such that is approximately equal to , our total transform size. This is a simple, but time consuming process, so that we do it during preprocessing. Then, we find such that . This is easy to find, since we pick .

After finding the appropriate , we do some fine-tuning. We know that , so we pick a new so that . Finally, we use binary search to search for such that

 (m2+1)2k−zm2z≤N≤(m2+1)(2k−z)+1m2z−1 (3.43)

This ensures that we’re within a factor of , as previously stated on page 3.1 in Section 3.1. Again, is slightly less than , but is certainly greater than , since that is the size of the next level of recursion.222This is because at each level of the recursion, we set, as the recursion size, almost exactly as the size of the next recursion. This ensures that the running time of teh smallest DFT is not significantly affected by .

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 3.3 Regarding P and N

Although we will use lookup tables of quartic roots of , we need a larger prime to do our modular calculations. To find out what

is, we will use preprocessing. However, to estimate the size of

, we make use of an analytic number theory theorem, called Linnik’s theorem.

Essentially, we know that we want to use tables of size . However, we need a prime that is at least as big as . We can start with Euler’s totient function, . It is well known that the multiplicative order of any value modulo divides . The multiplicative order of a number is essentially the smallest power of a number, modulo some prime, that is equal to . We also know that divides for a prime . Putting all of this together, we want to find such that divides .

To find this , we can therefore examine the arithmetic progression , for fixed and varying. We want this number to be prime, in which case we will use it as , and find a value modulo that has multiplicative order , which we’ll use as . This is where Linnik’s theorem is required. It states that any arithmetic progression where is fixed will have a prime such that , where is Linnik’s constant. The current best bounds on is 5, [16], but if the Generalized Riemann hypothesis is true it is [17].

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 4 Runtime Analysis

The major parts of the algorithm are shown in Algorithm 1.

We start by analyzing the preprocessing. To find a prime between and , we can use the sieve of Atkin and Bernstein[1]. It takes arithmetic operations to find all primes between and .

Next, we find an that has multiplicative order . This takes at most time , according to the following algorithm. First, we build a linked list of all numbers. Then we pick a number at random. We then repeatedly find powers of that number, erasing their presence from the list. If we find a number with multiplicative order when we arrive at , we are done. Otherwise, we continue and pick another number. Hopefully we arrive at one after iterations. If not, we either continue as long as we haven’t yet eliminated numbers, or if we have eliminated more than , we need to do more. In this case, we take the total multiplicative order of our base number we used, call it . We take the th power of this number, and this will then be . This all takes time .

However, if we guess at values for , our th root of unity, we can find an omega in randomized time , and even in time , where is the largest factor of , where is Euler’s totient function. The reason is that the discrete logarithm runs in time for many different algorithms, and it is well known that there are elements of maximum order modulo . According to [5], when combined with the Pohlig-Hellman algorithm, the running time for the discrete logarithm is , thus the second, faster runtime.

The table creation time is arithmetic operations for the multiplication tables for the usual DFT multiplications. This is because there are different inputs of size , times because the multiplication operation at most doubles the input words to yield output words.

We know the table creation time for the small prime at the leaves of the recursion is proportional to , since we’ve already established that the size of all of the tables to convert between and the small primes is less than . This also follows because all of the DFTs that are used are size , as was discussed in Section 3, and thus this will multiply the size of the tables by at most .

Thus, summing all of the preprocessing times together, we come up with an time for the preprocessing.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 4.1 Algorithm Speed

We already have explored much of the time required to compute the DFT. We use recursion to reach a DFT of size , and that DFT takes operations, times at most . So the total time for the new DFT algorithm takes time . Now, according to Section 3.1, this makes the total time be at most arithmetic operations. This is almost the speed of the algorithm. Recall that we set . Thus we have operations, which each take time , so that we have time, and this is the speed of the algorithm.

That is to say, with our calculations for from Equation 3.25 in Section 3 on page 3.43, the time is

 ρlogρ/logm < (4.1) ρlogρ/(loglog2n/cW(p2log2n/c)) < (4.2) ρlogρ/⎛⎜ ⎜⎝loglog2ρ/cW(rhoL2log2ρ/c)⎞⎟ ⎟⎠ = (4.3) ρlogρ/⎛⎜ ⎜⎝loglog2ρ/cW(rho52log2ρ/c)⎞⎟ ⎟⎠ (4.4)

Using Mathematica again shows that:

 limρ→∞ρlogρloglog2ρ/cW(rho52log2ρ/c)=0 (4.5)

So we are assured that this bound is better than the previous bound. This makes the time be , which equals

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

## 5 Linear Time DFT/NTT Overview

Here we explore a conjecture that there exists linear time NTT’s and linear time multiplication algorithms. Our dream becomes the idea that if , , , etc. are sizes of NTT’s, that we can somehow break apart a size into much smaller component DFT’s, but without recursion.

We begin this section with some preliminaries for the NTT, which we already reviewed is a type of DFT. According to [2], for computations done in a ring, for instance modulo , it suffices to use a principal root of unity , if has a reciprocal in this ring. The criterion is that the th principal root of unity is such that:

 ωn=1 (5.1) n−1∑i=0ωij=0,i=1,2,3,…,n−1 (5.2)

We conjecture that we can find a ring that allows a linear time NTT. The idea is actually not very complicated. We start with some finite fields, or number systems modulo primes for varying . The main idea is that for each “base” field that we use, or in other words for each , we have . Here is an auxiliary term, and we are mainly interested in the fact that has the product of all the other fields as a factor. The reason behind this is to ensure that the sum in Equation 5.2 is always 0. This happens because a sum of elements that repeats times modulo is always 0. In other words, the field has characteristic . This actually solves the toughest part of the problem.

The next step is to use field extensions. We essentially find a finite field of elements for each . Each of these fields will allow a maximum multiplicative order of at most . This is because there is one zero element, and all of the other elements are generated by at least one element in the field. To find these fields, we simply find an irreducible polynomial of degree over each field , and this will give us our field extensions along with the way to multiply the elements of each field. We are guaranteed to be able to find an irreducible polynomial of degree over each field, so this guarantees that we can create these fields.

Now, the idea is to use at least 3 and hopefully 4 primes, that are close in magnitude to each other. For each of these primes, we create the extension fields that use irreducible roots, as described above, to the fields for 3 or 4 ’s. The reasoning behind this is that, whatever the size of the primes, we can perform a size DFT or NTT in time at most , no matter how the field factors.333Note that the FFT requires the field to factor “nicely”. Then, if these primes are less than , a DFT or NTT of size can be performed in time modulo that prime.

Now that we know how to perform the DFT’s or NTT’s, we can determine how to use 3 of them to perform a size DFT or NTT. We start with , our root of unity. We find a -th root of unity modulo each . Then, this will be the final value, taken modulo for each . Given this omega, we realize that we have a -th root of unity. We will hope to maximize this value.

But first, we have yet to find out how these values are principal roots of unity, and not simply roots of unity. If we pick our primes , for each , and carefully, we can solve this problem. Now it’s easy to pick an value so that , but the hard part is finding values so that Equation 5.2 is always equal to . To do this, we set , for some . In other words, we assure ourselves that the sum will always be over , for some . If we make our sum large enough, this will ensure that we sum over all possible values of times. By doing this, we ensure that we multiply the sum in Equation 5.2 by , modulo , which assures us that the sum will be .

Now, we can revert back to our previous methods for solving the DFT or NTT. We use tabular lookup to ensure that our coefficients and values are never larger than . This assures us that the multiplication tables aren’t larger than . We can thus split up multiplication as in our preprocessing algorithm. We split up our multiplications modulo into many coefficients not larger than . We use our tabular multiplication on these coefficients. Thus each multiplication modulo , done in this fashion, takes time .

Now, assuming that we can take an original multiplication problem of bits, and transform it into a DFT or NTT of coefficients, each of size , we can solve this problem in roughly time. This is very parameter-dependent, meaning it is tough to find the primes and number systems that will allow it.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 5.1 Example O(n) Ntt

Through brute-force search, it is found that for primes of sizes , , and will admit a solution.

Then, we proceed similarly for and . It is hoped that this will allow for linear time NTTs, albeit with preprocessing. The preprocessing is simply to find the primes and irreducible roots. We can find the root of unity by testing all nonzero values, modulo , for maximum multiplicative order modulo , which takes time for values times , the time to determine the discrete logarithm and compare it to the maximum mulitplicative order. This will determine if we have found an value with maximum order that we can use.

[skipabove=12pt,skipbelow=0.5cm, leftline=false, rightline=false, bottomline=true, topline=false, linecolor=black, linewidth=2,innertopmargin=10pt,innerbottommargin=10pt]

### 5.2 Other possibilities

We can search for primes that allow this possibility by considering triples such that , , and . Even the possibility for quadruples exists. After the primes are found, we must find squares that do not exist modulo , , and . The final step is to ensure that the product of any two squares already exists modulo our set of primes.