# Faster integer multiplication using plain vanilla FFT primes

Assuming a conjectural upper bound for the least prime in an arithmetic progression, we show that n-bit integers may be multiplied in O(n log n 4^(log^* n)) bit operations.

## Authors

• 6 publications
• 4 publications
12/11/2017

### Faster integer and polynomial multiplication using cyclotomic coefficient rings

We present an algorithm that computes the product of two n-bit integers ...
11/17/2019

### Faster Integer Multiplication Using Preprocessing

A New Number Theoretic Transform(NTT), which is a form of FFT, is introd...
03/02/2017

### Faster truncated integer multiplication

We present new algorithms for computing the low n bits or the high n bit...
02/22/2018

### Faster integer multiplication using short lattice vectors

We prove that n-bit integers may be multiplied in O(n log n 4^log^* n)...
10/12/2020

### An exponent one-fifth algorithm for deterministic integer factorisation

Hittmeir recently presented a deterministic algorithm that provably comp...
11/16/2016

### Efficient Parallel Verification of Galois Field Multipliers

Galois field (GF) arithmetic is used to implement critical arithmetic co...
12/03/2019

### The Polynomial Transform

We explore a new form of DFT, which we call the Polynomial Transform. It...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Let be the number of bit operations required to multiply two -bit integers in the deterministic multitape Turing model [22]. A decade ago, Fürer [8, 9] proved that

 M(n)=O(nlognKlog∗n) (1)

for some constant . Here denotes the iterated logarithm, that is,

where (iterated  times). Harvey, van der Hoeven and Lecerf [14] subsequently gave a related algorithm that achieves (1) with the explicit value , and more recently Harvey announced that one may achieve by applying new techniques for truncated integer multiplication [13].

There have been two proposals in the literature for algorithms that achieve the tighter bound

 M(n)=O(nlogn4log∗n) (2)

under plausible but unproved number-theoretic hypotheses. First, Harvey, van der Hoeven and Lecerf gave such an algorithm [14, §9] that depends on a slight weakening of the Lenstra–Pomerance–Wagstaff conjecture on the density of Mersenne primes, that is, primes of the form , where

is itself prime. Although this conjecture is backed by reasonable heuristics and some numerical evidence, it is problematic for several reasons. At the time of writing, only 49 Mersenne primes are known, the largest being

[27]. More significantly, it has not been established that there are infinitely many Mersenne primes. Such a statement seems to be well out of reach of contemporary number-theoretic methods.

A second conditional proof of (2) was given by Covanov and Thomé [6], this time assuming a conjecture on the density of certain generalised Fermat primes, namely, primes of the form . Again, although their unproved hypothesis is supported by heuristics and some numerical evidence, it is still unknown whether there are infinitely many primes of the desired form. It is a famous unsolved problem even to prove that there are infinitely many primes of the form , of which the above generalised Fermat primes are a special case.

As an aside, we mention that the unproved hypotheses in [14, §9] and [6] may both be expressed as statements about the cyclotomic polynomials occasionally taking prime values: for [14, §9] we have , and for [6] we have .

In this paper we give a new conditional proof of (2), which depends on the following hypothesis. Let denote the totient function. For relatively prime positive integers  and , let denote the least prime in the arithmetic progression , and put .

###### Hypothesis P.

We have as .

Our main result is as follows.

###### Theorem 1.

Assume Hypothesis P. Then there is an algorithm achieving (2).

The overall structure of the new algorithm is largely inherited from [14, §9]

. In particular, we retain the strategy of reducing a “long” DFT (discrete Fourier transform) to many “short” DFTs via the Cooley–Tukey method

[5], and then converting these back to convolution problems via Bluestein’s trick [2]. The main difference between the new algorithm and [14, §9] is the choice of coefficient ring. The algorithm of [14, §9] worked over , where is a Mersenne prime and . In the new algorithm we use instead the ring , where  is a prime of the form , for an appropriate choice of and . Hypothesis P guarantees that such primes exist (take and ). The key new ingredient is the observation that we may convert an integer product modulo to a polynomial product modulo , by splitting the integers into chunks of bits.

In software implementations of fast Fourier transforms (FFTs) over finite fields, such as Shoup’s NTL library [25], it is quite common to work over where is a prime of the form that fits into a single machine register. Such primes are sometimes called FFT primes; they are popular because it is possible to perform a radix-two FFT efficiently over with a large power-of-two transform length. Our Theorem 1 shows that such primes remain useful even in a theoretical sense as .

From a technical point of view, the new algorithm is considerably simpler than that of [14, §9]. The main reason for this is that we have complete freedom in our choice of  (the coefficient size), and in particular we may easily choose to be divisible by the desired chunk size. By contrast, the choice of in [14, §9] is dictated by the rather erratic distribution of Mersenne primes; this forces one to deal with the technical complication of splitting integers into chunks of ‘non-integral size’, which was handled in [14, §9] by adapting an idea of Crandall and Fagin [7].

We remark that the conditional algorithm of Covanov and Thomé [6] achieves by a rather different route. Instead of using Bluestein’s trick to handle the short transforms, the authors follow Fürer’s original strategy, which involves constructing a coefficient ring containing “fast” roots of unity. The algorithm of this paper, like the algorithms of [14], makes no use of such “fast” roots.

Let us briefly discuss the evidence in favour of Hypothesis P. The best unconditional bound for is currently Xylouris’s refinement of Linnik’s theorem, namely [30]. If is a prime power (the case of interest in this paper), one can obtain [4, Cor. 11]. Assuming the Generalised Riemann Hypothesis (GRH), one has [19]. All of these bounds are far too weak for our purposes.

The tighter bound in Hypothesis P was suggested by Heath–Brown [17, 18]

. It can be derived from the reasonable assumption that a randomly chosen integer in a given congruence class should be no more or less ‘likely’ to be prime than a random integer of the same size, after correcting the probabilities to take into account the divisors of

. A detailed discussion of this argument is given by Wagstaff [29], who also presents some supporting numerical evidence. In the other direction, Granville and Pomerance [11] have conjectured that . These questions have been revisited in a recent preprint of Li, Pratt and Shakan [21]; they give further numerical data, and propose the more precise conjecture that

 liminfqP(q)φ(q)log2q=1,limsupqP(q)φ(q)log2q=2.

The consensus thus seems to be that is the right order of magnitude for , although a proof is apparently still elusive.

For the purposes of this paper, there are several reasons that Hypothesis P is much more compelling than the conjectures required by [14, §9] and [6]. First, it is well known that there are infinitely many primes in any given congruence class, and we even know that asymptotically the primes are equidistributed among the congruence classes modulo . Second, one finds that, in practice, primes of the required type are extremely common. For example, we find that is prime for

 a=13,306,726,2647,3432,5682,5800,5916,6532,7737,8418,8913,9072,…

and there are still plenty of opportunities to hit primes before exhausting the possible values of up to about allowed by Hypothesis P.

Third, we point out that Hypothesis P is actually much stronger than what is needed in this paper. We could prove Theorem 1 assuming only the weaker statement that there exists a logarithmically slow function (see [14, §5]) such that for all large . For example, we could replace in Hypothesis P by for any fixed , or even by for any fixed . To keep the complexity arguments in this paper as simple as possible, we will only give the proof of Theorem 1 for the simplest form of Hypothesis P, as stated above.

It is interesting to ask for what bit size the new algorithm would be faster than the asymptotically inferior multiplication algorithms that are used in practice. The current reference implementation in the GMP library [10] uses Schönhage–Strassen’s algorithm [24]. On recent architectures, Pollard’s algorithm [23] tends to be very competitive as well [12]. Concerning our new algorithm, we stress that the present paper is optimised for simplicity rather than speed. An optimised version would essentially coincide with Pollard’s algorithm for sizes , where corresponds to the bit size of an integer hardware register, so the main recursive step in our algorithm would only be invoked for super-astronomical sizes. This does not withstand that some of the techniques that we developed for proving the new complexity bound may have practical applications for much smaller sizes. This is exactly what happened for the related problem of carryless integer multiplication: new techniques from [16] led to faster practical implementations [15, 20].

## 2. The algorithm

Define for . For the rest of the paper we assume that Hypothesis P holds, and hence we may fix an absolute constant such that

 P(q)⩽Cq(lgq)2

for all . (Numerical evidence suggests that achieves its maximum value at , implying that one may take .)

An admissible size is an integer of the form

 m=k(lgk)3

for some integer . For such , let denote the smallest prime of the form

 p=a⋅2m+1.

Hypothesis P implies that

 1⩽a

In the proof of Proposition 2 below, we will describe a recursive algorithm Transform that takes as input an admissible size , the corresponding prime , a power-of-two transform length such that

 (lgm)4

a primitive -th root of unity (such a primitive root exists as and ), and a polynomial . Its output is the DFT of with respect to

, that is, the vector

 ^F:=(F(1),F(ζ),…,F(ζL−1))∈(Fp)L.

For sufficiently large , Proposition 2 will reduce the computation of such a DFT to the computation of a large collection of similar DFTs over for an exponentially smaller prime

 p′=p0(m′)=a′⋅2m′+1

where

 m′=k′(lgk′)3∼2(lgm)3.

The main reduction consists of five steps, which may be summarised as follows:

1. Reduce the given ‘long’ transform of length  over to many ‘short’ transforms of exponentially smaller length over , via the Cooley–Tukey decomposition.

2. Reduce each short transform to a product in , i.e., a cyclic convolution of length , using Bluestein’s algorithm.

3. By splitting each coefficient in into exponentially smaller chunks of bit size , reduce each product from step (ii) to a product in .

4. Embed each product from step (iii) into , for a suitable prime that is exponentially smaller than ; more precisely, .

5. Reduce each product from step (iv) to a collection of forward and inverse DFTs of length  over , and recurse.

### 2.1. The main recursion

Let us now present the main reduction in more detail together with its complexity analysis. We denote the running time of Transform by . For there is always at least one integer in the interval (4), so we may define the normalisation

 T(m):=max(lgm)4

In our algorithm we must often perform auxiliary arithmetic operations on ‘small’ integers. These will always be handled via the Schönhage–Strassen algorithm [24] and Newton’s method [28, Ch. 9]; thus we may compute products, quotients and remainders of -bit integers in bit operations.

###### Proposition 2.

There exist absolute constants , and with the following property. For any admissible , there exists an admissible such that

 T(m)<(4+C1lglgm)T(m′)+C2. (5)
###### Proof.

Assume that we are given as input an admissible size , the corresponding prime , a transform length satisfying (4), a primitive -th root of unity , and a polynomial ; our goal is to compute . For the base case , we may compute  using any convenient algorithm. In what follows, we assume that and that is increased whenever necessary to accommodate statements that hold only for large .

Let us now detail the reductions (i)–(v) mentioned above.

Step (i) — reduce to short DFTs. In this step we reduce the given transform of length to a collection of short transforms of length

 S:=2(lgm)2.

By (4) we have , so . Let ; then is a primitive -th root of unity in .

Let , so that where . Applying the Cooley–Tukey method [5] to the factorisation , the given transform of length may be decomposed into  layers, each consisting of transforms of length  (with respect to ), followed by layers, each consisting of transforms of length . Between each of these layers, we must perform multiplications by ‘twiddle factors’ in , which are given by certain powers of . (For further details of the Cooley–Tukey decomposition, see for example [14, §2.3].)

Let us estimate the total cost of the twiddle factor multiplications. We have

, and by (4) also , so the total number of twiddle factor multiplications is . Using the Schönhage–Strassen algorithm, each multiplication in costs at most bit operations. As we have , so (3) implies that . Thus the cost of each multiplication in is bit operations, and the total cost of the twiddle factor multiplications is bit operations. This bound also covers the cost of the length  transforms (‘butterflies’), each of which requires one addition and one subtraction in .

In the Turing model, we must also account for the cost of rearranging data so that the inputs for each layer of short DFTs are stored sequentially on tape. Using a fast matrix transpose algorithm, the cost per layer is bit operations (see [14, §2.3] for further details), so bit operations altogether.

Let denote the number of bit operations required to perform transforms of length with respect to , i.e., the cost of one layer of short transforms. Since the number of layers of short transforms is , the above discussion shows that

 T(m,L)

Step (ii) — reduce to short convolutions. In this step we use Bluestein’s algorithm [2] to convert the short transforms into convolution problems. Suppose that at some layer of the main DFT we are given as input the short polynomials

 at(X)=S−1∑i=0at,iXi∈Fp[X]/(XS−1),t=1,…,L/S.

We wish to compute , the DFTs with respect to .

Let so that . For and , define

 ft,i:=ηi2at,i,gi:=η−i2,

and put

 ft(X):=S−1∑i=0ft,iXi,g(X):=S−1∑i=0giXi,

regarded as polynomials in . We may compute all of the , and compute all of the from the , using operations in . Then one finds (see for example [14, §2.5]) that , where

 ht:=ftg=S−1∑i=0ht,iXi∈Fp[X]/(XS−1). (7)

In other words, computing the short DFTs reduces to computing the products in , plus an additional operations in .

Let denote the cost of computing the products in . As noted in step (i), each multiplication in costs bit operations, so the above discussion shows that

 Tshort(m,L)

Substituting into (6) yields

 T(m,L)

Step (iii) — reduce to bivariate product over . In this step we transport the problem of computing the products in to the ring

 R:=Z[X,Y]/(XS−1,Yk+a),

by cutting up each coefficient in into chunks of bit size

 r:=m/k=(lgk)3.

We note for future reference the estimate

 r=(1+O(1)lglgm)(lgm)3; (9)

this follows from , because

 lgm=lgk+O(lglgk)=(1+O(lglgk)lgk)lgk=(1+O(1)lglgm)lgk.

Interpreting each and as an integer in the interval , and decomposing them in base , we write

 ft,i=k−1∑j=0ft,i,j2(k−1−j)r,gi=k−1∑j=0gi,j2(k−1−j)r, (10)

where and are integers in the interval

 0⩽ft,i,j,gi,j⩽2ra. (11)

(In fact, they are less than for ; the bound is only needed for the first term .) Then define polynomials

 Ft:=S−1∑i=0k−1∑j=0ft,i,jXiYj,G:=S−1∑i=0k−1∑j=0gi,jXiYj,

regarded as elements of , and let

 Ht:=FtG=S−1∑i=0k−1∑j=0ht,i,jXiYj

be the corresponding products in for .

We claim that knowledge of determines ; specifically, that

 ht,i=k−1∑j=0ht,i,j2(2k−2−j)r(modp) (12)

for each pair . To prove this, observe that by definition of multiplication in ,

 ht,i,j=∑i1+i2=imodS(∑j1+j2=jmodkj1+j2

On the other hand, from (7) and (10) we have

 ht,i =∑i1+i2=imodSft,i1gi2 =2(2k−2)r∑i1+i2=imodS k−1∑j1=0k−1∑j2=0ft,i1,j1gi2,j22−(j1+j2)r,

and since , we obtain

 ht,i=2(2k−2)r∑i1+i2=imodSk−1∑j=0==( ∑j1+j2=jmodkj1+j2

Comparing this expression with (13) yields (12). (The reason this works is that and are the images of and under the ring homomorphism from to that sends to , modulo a scaling factor of .)

Let us estimate the cost of using (12) to compute for a single pair , assuming that the are known. By (13) and (11) we have

 |ht,i,j|⩽(Sk)a(2ra)2=22rSka3.

Then, using (3) and the fact that , we obtain , so

 lg|ht,i,j|<2r+(lgm)2+7lgm+O(1).

Moreover, as , for large we certainly have , so

 r=(lgk)3⩽(lgm−1)3⩽(lgm)3−2(lgm)2,

and we deduce that

 lg|ht,i,j|<2(lgm)3. (14)

In particular, by (9), we see that has bit size . Now, to compute , we first evaluate the sum in (12) (in ) using a straightforward overlap-add procedure; this costs bit operations. Then we reduce the result modulo ; this costs a further bit operations (using the Schönhage–Strassen algorithm, as in step (i)). The total cost over all pairs is bit operations.

Let denote the cost of computing the products in . The above discussion shows that

 Cshort(m,L)

Substituting into (8) yields

 T(m,L)

Step (iv) — reduce to bivariate multiplication over . In this step we transfer the above multiplication problems from to the ring

 S:=Fp′[X,Y]/(XS−1,Yk+a),

where for a suitable choice of admissible size . The idea is to choose to be just large enough that computing the desired products modulo  determines their coefficients unambiguously in .

To achieve this, we will take where

 k′:=⌈β(lgβ−3lglgβ)3⌉,β:=2(lgm)3.

Note that is admissible (we may ensure that by taking sufficiently large). We claim that with this choice of we have

 β⩽m′<(1+O(1)lglgm)β, (16)

and consequently, taking (9) into account,

 m′=(2+O(1)lglgm)r. (17)

To establish the first inequality in (16), observe that since , we have

 log2k′⩾log2β−3log2lgβ⩾log2β−3lglgβ.

Hence , which implies that . For the second inequality, since we have

 m′=k′(lgk′)3 ⩽(β(lgβ−3lglgβ)3+1)(lgβ)3 =((lgβ)3(lgβ−3lglgβ)3+(lgβ)3β)β =(1+O((lglgβ)3)(lgβ)3)β=(1+O(1)lglgm)β.

Let us estimate the cost of computing . Hypothesis P ensures that , so we may locate by testing each candidate using a naive primality test (trial division) in bit operations. By (4) and (16) this amounts to

 2O(m′)=2O(β)=2O((lgm)3)=2O((lgL)3/4)=O(L) (18)

bit operations.

Now let be as in step (iii), and let be their images in ; that is,

 ut:=S−1∑i=0k−1∑j=0ut,i,jXiYj,v:=S−1∑i=0k−1∑j=0vi,jXiYj,

where and are the images in of and

. Computing these images amounts to zero-padding each coefficient up to

bits; by (17) we have , so the total cost of this step is bit operations.

Let for each . Clearly is the image in of . By (14) and (16), the coefficients of are completely determined by those of , as and . Moreover, this lifting can be carried out in linear time, so the cost of deducing from (for all ) is again bit operations.

Let denote the cost of computing the products in . The above discussion shows that

 Cbiv(m,L)

Substituting into (15) yields

 T(m,L)

Step (v) — reduce to DFTs over . Since and , there exists a primitive -th root of unity . We may find one such primitive root by a brute force search in bit operations (see (18)).

We will compute the products in by first performing DFTs with respect to , and then multiplying pointwise in . More precisely, for each let

 Ut,j:=S−1∑i=0ut,i,jXi,Vj:=S−1∑i=0vi,jXi,

regarded as polynomials in . We call Transform recursively to compute the transforms of each and with respect to , i.e., to compute the polynomials

 ut((ζ′)i,Y)=k−1∑j=0Ut,j((ζ′)i)Yj,v((ζ′)i,Y)=k−1∑j=0Vj((ζ′)i)Yj,

as elements of , for each and . The precondition corresponding to (4) for these recursive calls is

 (lgm′)4

This is certainly satisfied for large , as and by (16). There are transforms, so their total cost is bit operations.

We next compute the pointwise products

 wt((ζ′)i,Y)=ut((ζ′)i,Y)⋅v((ζ′)i,Y)

in for each and . Using the Schönhage–Strassen algorithm (both the integer variant and the polynomial variant [3]), the cost of each product is bit operations. Using the bounds , , and , this amounts to

 O((klgmlglgm)(m′lgm′lglgm′)) =O(mlgm(lglgm)2lglglgm) =O(m(lgm)2)

bit operations, or bit operations over all and .

Finally, we perform inverse DFTs with respect to to recover . It is well known that these inverse DFTs may be computed by the same algorithm as the forward DFT, with replaced by , followed by a division by . The divisions cost bit operations altogether (they are no more expensive than the pointwise multiplications), so the cost of this step is .

The procedure just described requires some data rearrangement, so that the DFTs and pointwise multiplication steps can access the necessary data sequentially. Using a fast matrix transpose algorithm, this costs bit operations altogether.

Combining all the contributions mentioned above, we obtain

 Ctiny(m,L)<(2LS+1)kT(m′,S)+O(Lm(lgm)2).

Substituting into (19) yields

 T(m,L)

Conclusion. This concludes the description of the algorithm; it remains to establish (5). Dividing the previous inequality by , and recalling that , we obtain

 T(m,L)mLlgL<(2+SL)km′m⋅T(m′,S)m′SlgS+O(1).

By definition of we have . Equation (17) implies that

 km′=(2+O(1)lglgm)kr=(2+O(1)lglgm)m,

and (4) yields . Therefore

 T(m,L)mLlgL<(4+O(1)lglgm)T(m′)+O(1).

Taking the maximum over all yields the desired bound (5). ∎

###### Corollary 3.

The corollary could be deduced from Proposition 2 by using [14, Prop. 8]. We give a simpler (but less general) argument here.

###### Proof.

Let , and be as in Proposition 2. We may assume, increasing if necessary, that

 (logm)1/2

for all . Define