Let denote the number of bit operations required to multiply two
-bit integers, where “bit operations” means the number of steps on a deterministic Turing machine with a fixed, finite number of tapes (our results also hold in the Boolean circuit model). Let denote the iterated natural logarithm, i.e., , where (iterated times). The main result of this paper is an algorithm achieving the following bound.
We have .
The first complexity bound for of the form was established by Fürer [8, 9], for an unspecified constant . His algorithm reduces a multiplication of size to many multiplications of size exponentially smaller than , which are then handled recursively. The number of recursion levels is , and the constant measures the “expansion factor” at each recursion level.
The first explicit value for , namely , was given by Harvey, van der Hoeven and Lecerf . Their method is somewhat different to Fürer, but still carries out an exponential size reduction at each recursion level. One may think of the constant as being built up of three factors of , each coming from a different source.
The first factor of
arises from the need to perform both forward and inverse DFTs (discrete Fourier transforms) at each recursion level. This is a feature common to all of the post-Fürer algorithms, suggesting that significantly new ideas will be needed to do any better than.
The second factor of arises from coefficient growth: a product of polynomials with -bit integer coefficients has coefficients with at least bits. This factor of also seems difficult to completely eliminate, although Harvey and van der Hoeven have recently made some progress : they achieve by arranging that, in effect, the coefficient growth only occurs at every second recursion level. This was the best known unconditional value of prior to the present paper.
The final factor of occurs because the algorithm works over : when multiplying complex coefficients with say significant bits, the algorithm first computes a full -bit product, and then truncates to bits. More precisely, after splitting the -bit inputs into exponentially smaller chunks, and encoding them into polynomials of degree , the algorithm must compute the full product of degree , even though essentially only coefficients are needed to resolve significant bits of the product. Again, this factor of has been the subject of a recent attack: Harvey has shown  that it is possible to work modulo a polynomial of degree only , at the expense of increasing the working precision by a factor of . This leads to an integer multiplication algorithm achieving .
Another way of attacking this last factor of is to replace the coefficient ring by a finite ring for a suitable integer . By choosing with some special structure, it may become possible to convert a multiplication modulo directly into a polynomial multiplication modulo some polynomial of degree , rather than . Three algorithms along these lines have been proposed.
First, Harvey, van der Hoeven and Lecerf suggested using Mersenne primes, i.e., primes of the form , where is itself prime [13, §9]. They convert multiplication in to multiplication in , where is a power of two. Because is not divisible by , the process of splitting an element of into chunks is somewhat involved, and depends on a variant of the Crandall–Fagin algorithm .
Covanov and Thomé  later proposed using generalised Fermat primes, i.e., primes of the form , where is a power of two and is a small even integer. Here, multiplication in is converted to multiplication in . The splitting procedure consists of rewriting an element of in base , via fast radix-conversion algorithms.
Finally, Harvey and van der Hoeven  proposed using FFT primes, i.e., primes of the form , where is small. They reduce multiplication in to multiplication in via a straightforward splitting of the integers into chunks, where is a power of two. Here the splitting process is trivial, as may be chosen to be divisible by .
These three algorithms all achieve , subject to plausible but unproved conjectures on the distribution of the relevant primes. Unfortunately, in all three cases, it is not even known that there are infinitely many primes of the required form, let alone that there exist a sufficiently high density of them to satisfy the requirements of the algorithm.
The main technical novelty of the present paper is a splitting procedure that works for an almost arbitrary modulus . The core idea is to introduce an alternative representation for elements of : we represent them as expressions , where is some fixed -th root of unity in , and where the are small integers, of size roughly . Essentially the only restriction on is that must contain an appropriate -th root of unity. We will see that Linnik’s theorem is strong enough to construct suitable such moduli .
In Section 2 we show that the cost of multiplication in this representation is only a constant factor worse than for the usual representation. The key ingredient is Minkowski’s theorem on lattice vectors in symmetric convex sets. We also give algorithms for converting between this representation and the standard representation. The conversions are not as fast as one might hope — in particular, we do not know how to carry them out in quasilinear time — but surprisingly this turns out not to affect the overall complexity, because in the main multiplication algorithm we perform the conversions only infrequently.
Then in Sections 3 and 4 we prove Theorem 1.1, using an algorithm that is structurally very similar to . We make no attempt to minimise the implied big- constant in Theorem 1.1; our goal is to give the simplest possible proof of the asymptotic bound, without any regard for questions of practicality.
An interesting question is whether it is possible to combine the techniques of the present paper with those of  to obtain an algorithm achieving . Our attempts in this direction have so far been unsuccessful. One might also ask if the techniques of this paper can be transferred to the case of multiplication of polynomials of high degree in . However, this is not so interesting, because an unconditional proof of the bound corresponding to in the polynomial case is already known .
Throughout the paper we use the following notation. We write for , and for convenience put . We define , where is some constant so that the Schönhage–Strassen algorithm multiplies -bit integers in at most bit operations . This function satisfies for any , and also for fixed . An -bit integer may be divided by -bit integer, producing quotient and remainder, in time [20, Ch. 9]. We may transpose an array of objects of bit size in bit operations [4, Appendix]. Finally, we occasionally use Xylouris’s refinement of Linnik’s theorem , which states that for any relatively prime positive integers and , the least prime in the arithmetic progression satisfies .
Throughout this section, fix an integer and a power of two such that
and assume we are given some such that .
For a polynomial , define . This norm satisfies for any .
Let . A -representation for is a polynomial such that and .
The coefficients in a -representation must not exceed . Two examples of -representations for are
By (2.1), the number of bits required to store is at most
so a -representation incurs very little overhead in space compared to the standard representation by an integer in the interval .
Our main tool for working with -representations is the reduction algorithm stated in Lemma 2.9 below. Given a polynomial , whose coefficients are up to about twice as large as allowed in a -representation, the reduction algorithm computes a -representation for (up to a certain scaling factor, discussed further below). The basic idea of the algorithm is to precompute a nonzero polynomial such that , and then to subtract an appropriate multiple of from to make the coefficients small.
After developing the reduction algorithm, we are able to give algorithms for basic arithmetic on elements of given in -representation (Proposition 2.15), a more general reduction algorithm for inputs of arbitrary size (Proposition 2.17), and algorithms for converting between standard and -representation (Proposition 2.18 and Proposition 2.20).
We begin with two results that generate certain precomputed data necessary for the main reduction step.
In bit operations, we may compute a nonzero polynomial such that and .
We first establish existence of a suitable . Let denote a lift of to , and consider the lattice spanned by the rows of the integer matrix
Every vector satisfies the equation . The volume of the fundamental domain of is . The volume of the closed convex symmetric set is , so by Minkowski’s theorem (see for example [14, Ch. V, Thm. 3]), there exists a nonzero vector in . The corresponding polynomial then has the desired properties.
To actually compute , we simply perform a brute-force search. By (2.1) there are at most candidates to test. Enumerating them in lexicographical order, we can easily evaluate in an average of bit operations per candidate. ∎
Continuing Example 2.2, the coefficients of must not exceed . A suitable polynomial is given by
The computation of is closely related to the problem of finding an element of small norm in the ideal of the ring generated by and , where denotes a primitive -th root of unity.
The poor exponential-time complexity of Lemma 2.3
can probably be improved, by taking advantage of more sophisticated lattice reduction or shortest vector algorithms, but we were not easily able to extract a suitable result from the literature. For example, LLL is not guaranteed to produce a short enough vector, and the Micciancio–Voulgaris exact shortest vector algorithm  solves the problem for the Euclidean norm rather than the uniform norm. In any case, this has no effect on our main result.
Assume that has been precomputed as in Lemma 2.3. Let be the smallest prime exceeding such that and such that is invertible in . Then , and in bit operations we may compute and a polynomial such that and .
Let be the resultant of (regarded as a polynomial in ) and . The primes dividing are exactly the primes for which fails to be invertible in . Therefore our goal is to find a prime such that .
Since is a power of two, is a cyclotomic polynomial and hence irreducible in . Thus and have no common factor, and so . Also, we have where runs over the complex roots of . These roots all lie on the unit circle, so , and hence by (2.1) we obtain .
On the other hand, the prime number theorem (in the form , see for example [1, §4.3]) implies that there exists an absolute constant such that for any we have (sum taken over primes). Taking , by (2.1) again we get
In particular, there must be at least one prime in the interval that does not divide .
To find the smallest such , we first make a list of all primes up to in bit operations. Then for each prime between and , we check whether divides in bit operations, and attempt to invert in in bit operations [20, Ch. 11]. ∎
Continuing Example 2.2, we have and
Now we come to the main step of the reduction algorithm, which is inspired by Montgomery’s method for modular reduction .
We first compute the “quotient” , normalised so that . This is done by means of Kronecker substitution [20, Ch. 8], i.e., we pack the polynomials and into integers, multiply the integers, unpack the result, and reduce the result modulo and modulo . The packed integers have at most bits, where the term accounts for coefficient growth in . By (2.1) and Lemma 2.7, this simplifies to bits, so the integer multiplication step costs bit operations. This bound also covers the cost of the reductions modulo .
Next we compute the product , again using Kronecker substitution, at a cost of bit operations. Since and , we have .
By construction of we have . In particular, all the coefficients of are divisible by . The last step is to compute the “remainder” ; again, this step costs bit operations. Since , we have
Finally, since , and all arithmetic throughout the algorithm has been performed modulo , we see that . ∎
Using the above reduction algorithm, we may give preliminary addition and multiplication algorithms for elements of in -representation.
Let the -representations be given by . We may compute in using Kronecker substitution in bit operations, and in bit operations. Note that , and , so we may apply Lemma 2.9 to obtain the desired -representations. ∎
Continuing Example 2.2, we walk through an example of computing a product of elements in -representation. Let
Suppose we are given as input the -representations
We first compute the product of and modulo :
We multiply by and reduce modulo to obtain the quotient
Then the remainder
is a -representation for .
We may easily compute the totient function in bit operations, by first factoring . Since , we have . Repeatedly using the identity , we may compute -representations for by successively applying Lemma 2.10. ∎
Continuing Example 2.2, we may take
Henceforth we write for the tuple of precomputed data generated by Lemmas 2.3, 2.7, and 2.12. Given , and as input, the above results show that we may compute in bit operations. With these precomputations out of the way, we may state complexity bounds for the main operations on -representations.
Assume that has been precomputed. Given as input -representations for , we may compute -representations for and in bit operations.
We suspect that the complexity bound for can be improved to , but we do not currently know how to achieve this. This question seems closely related to Remark 2.22 below.
Assume that has been precomputed. Given as input a polynomial (with no restriction on ), we may compute a -representation for in time .
Let and , so that
We may therefore decompose the coefficients of into chunks of bits, i.e., we may compute polynomials such that and . (This step implicitly requires an array transposition of cost .) Now we use Proposition 2.15 repeatedly to compute a -representation for via Horner’s rule, i.e., first we compute a -representation for , then for , and so on. ∎
Assume that has been precomputed. Given as input an element in standard representation, we may compute a -representation for in bit operations.
Simply apply Proposition 2.17 to the constant polynomial , noting that . ∎
A corollary of Proposition 2.18 is that every admits a -representation. It would be interesting to have a direct proof of this fact that does not rely on the reduction algorithm. A related question is whether it is possible to tighten the bound in the definition of -representation from to , or even . We do not know whether such a representation exists for all .
Given as input an element in -representation, we may compute the standard representation for in bit operations.
Let be the input polynomial. The problem amounts to evaluating in . Again we may simply use Horner’s rule. ∎
In the reduction algorithm, the reader may wonder why we go to the trouble of introducing the auxiliary prime . Why not simply precompute an approximation to a real inverse for , i.e., the inverse in , and use this to clear out the high-order bits of each coefficient of the dividend? In other words, why not replace the Montgomery-style division with the more natural Barrett-style division ?
The reason is that we cannot prove tight enough bounds on the size of the coefficients of this inverse: it is conceivable that might accidentally take on a very small value near one of the complex roots of , or equivalently, that the resultant in the proof of Lemma 2.7 might be unusually small. For the same reason, we cannot use a more traditional -adic Montgomery inverse to clear out the low-order bits of the dividend, because again may take a -adically small value near one of the -adic roots of , or equivalently, the resultant might be divisible by an unusually large power of .
3. Integer multiplication: the recursive step
In this section we present a recursive routine Transform with the following interface. It takes as input a (sufficiently large) power-of-two transform length , a prime , a prime power such that
a principal -th root of unity (i.e., an -th root of unity whose reduction modulo is a primitive -th root of unity in the field ), certain precomputed data depending on , and (see below), and a polynomial . Its output is the DFT of with respect to , that is, the vector
The coefficients of both and are given in standard representation.
The precomputed data consists of the tuple defined in Section 2, where and are defined as follows.
First, (3.1) implies that for large , so we may take to be the unique power of two lying in the interval
We remark that the role of the parameter is to give us enough control over the bit size of , to compensate for the fact that Linnik’s theorem does not give us sufficiently fine control over the bit size of (see Lemma 3.5).
Our implementation of Transform uses one of two algorithms, depending on the size of . If is below some threshold, say , then it uses any convenient base-case algorithm. Above this threshold, it reduces the given DFT problem to a collection of exponentially smaller DFTs of the same type, via a series of reductions that may be summarised as follows.
Use the conversion algorithms from Section 2 to reduce to a transform over where the input and output coefficients are given in -representation. (During steps (ii) and (iii) below, all elements of are stored and manipulated entirely in -representation.)
Reduce the “long” transform of length over to many “short” transforms of exponentially small length over , via the Cooley–Tukey decomposition.
Reduce each short transform from step (ii) to a product in , i.e., a cyclic convolution of length , using Bluestein’s algorithm.
Use the definition of -representation to reinterpret each product from (iii) as a product in , where the coefficients in are exponentially smaller than the original coefficients in .
Embed each product from (iv) into for a suitable prime power that is exponentially smaller than , and large enough to resolve the coefficients of the products over .
Reduce each product from (v) to a collection of forward and inverse DFTs of length over , and recurse.
The structure of this algorithm is very similar to that of . The main difference is that it is not necessary to explicitly split the coefficients into chunks in step (iv); this happens automatically as a consequence of storing the coefficients in -representation. In effect, the splitting (and reassembling) work has been shunted into the conversions in step (i).
We now consider each of the above steps in more detail. We write for the running time of Transform. We always assume that is increased whenever necessary to accommodate statements that hold only for large .
Step (i) — convert between representations. Let denote the time required to compute a DFT of length over with respect to , assuming that the coefficients of the input and the output are given in -representation, and assuming that is known.
We have .
Henceforth all elements of are assumed to be stored in -representation, and we will always use Proposition 2.15 to perform arithmetic operations on such elements in bit operations.
Step (ii) — reduce to short DFTs. Let . Given as input polynomials (presented sequentially on tape), let denote the time required to compute the transforms with respect to the principal -th root of unity . (Here and below, we continue to assume that is known.).
We have .
Let , so that where . Applying the Cooley–Tukey method  to the factorisation , the given transform of length may be decomposed into layers, each consisting of transforms of length (with respect to ), followed by layers, each consisting of transforms of length . Between each of these layers, we must perform multiplications by “twiddle factors” in , which are given by certain powers of . (For further details of the Cooley–Tukey decomposition, see for example [13, §2.3].)
The total cost of the twiddle factor multiplications, including the cost of computing the twiddle factors themselves, is
This bound also covers the cost of the length transforms (‘butterflies’), each of which requires one addition and one subtraction in .
In the Turing model, we must also account for the cost of rearranging data so that the inputs for each layer of short DFTs are stored sequentially on tape. The cost per layer is bit operations, so altogether (see [13, §2.3] for further details). ∎
Step (iii) — reduce to short convolutions. Given polynomials as input, let denote the time required to compute the products .
We have .
We use Bluestein’s method , which reduces the the problem of computing the DFT of to the problem of computing the product of certain polynomials , plus auxiliary multiplications in (for further details see [13, §2.5]). Here depends on and , but depends only on . The total cost of the auxiliary multiplications is
Step (iv) — reduce to bivariate products over . Given as input polynomials , all whose of coefficients are bounded in absolute value by , let denote the cost of computing the products .
We have .
We are given as input polynomials . Since their coefficients are given in -representation, we may immediately reinterpret them as polynomials , with coefficients bounded by . By definition of -representation, we have , and similarly for the .
After computing the products for , suppose that
Then we have for each . Therefore, to compute the desired products with coefficients in -representation, it suffices to apply Proposition 2.17 to each , to compute -representations for all of the .
Step (v) — Reduce to bivariate products over . Let be the smallest prime such that ; by Linnik’s theorem we have . Put where
We have the following bounds for .
Let be as in the proof of Lemma 3.4, for and . Then and
In what follows, we frequently use the fact that