Faster integer and polynomial multiplication using cyclotomic coefficient rings

by   David Harvey, et al.

We present an algorithm that computes the product of two n-bit integers in O(n log n (4√(2))^log^* n) bit operations. Previously, the best known bound was O(n log n 6^log^* n). We also prove that for a fixed prime p, polynomials in F_p[X] of degree n may be multiplied in O(n log n 4^log^* n) bit operations; the previous best bound was O(n log n 8^log^* n).



There are no comments yet.


page 1

page 2

page 3

page 4


Faster integer multiplication using plain vanilla FFT primes

Assuming a conjectural upper bound for the least prime in an arithmetic ...

A fast algorithm for solving linearly recurrent sequences

We present an algorithm which computes the D^th term of a sequence satis...

Faster integer multiplication using short lattice vectors

We prove that n-bit integers may be multiplied in O(n log n 4^log^* n)...

An exponent one-fifth algorithm for deterministic integer factorisation

Hittmeir recently presented a deterministic algorithm that provably comp...

The Polynomial Transform

We explore a new form of DFT, which we call the Polynomial Transform. It...

Ultrasparse Ultrasparsifiers and Faster Laplacian System Solvers

In this paper we provide an O(m (loglog n)^O(1)log(1/ϵ))-expected time a...

Toward an Optimal Quantum Algorithm for Polynomial Factorization over Finite Fields

We present a randomized quantum algorithm for polynomial factorization o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In this paper we present new complexity bounds for multiplying integers and polynomials over finite fields. Our focus is on theoretical bounds rather than practical algorithms. We work in the deterministic multitape Turing model [19]

, in which time complexity is defined by counting the number of steps, or equivalently, the number of ‘bit operations’, executed by a Turing machine with a fixed, finite number of tapes. The main results of the paper also hold in the Boolean circuit model.

The following notation is used throughout. For , we denote by the iterated logarithm, that is, the least non-negative integer such that , where (iterated times). For a positive integer , we define ; in particular, expressions like are defined and take positive values for all . We denote the -th cyclotomic polynomial by , and the Euler totient function by .

All absolute constants in this paper are in principle effectively computable. This includes the implied constants in all uses of notation.

1.1. Integer multiplication

Let denote the number of bit operations required to multiply two -bit integers. For over 35 years, the best known bound for was that achieved by the Schönhage–Strassen algorithm [23], namely


In 2007, Fürer described an asymptotically faster algorithm that achieves


for some unspecified constant [11, 12]. His algorithm reduces a multiplication of size to a large collection of multiplications of size exponentially smaller than ; these smaller multiplications are handled recursively. The term may be understood roughly as follows: the number of recursion levels is , and the constant

measures the amount of ‘data expansion’ that occurs at each level, due to phenomena such as zero-padding.

Immediately following Fürer’s work, De, Kurur, Saha and Saptharishi described a variant based on modular arithmetic [10], instead of the approximate complex arithmetic used by Fürer. Their algorithm also achieves (1.2), again for some unspecified .

The first explicit value for was given by Harvey, van der Hoeven and Lecerf, who described an algorithm that achieves (1.2) with [17]. Their algorithm borrows some important ideas from Fürer’s work, but also differs in several respects. In particular, their algorithm has no need for the ‘fast’ roots of unity that were the cornerstone of Fürer’s approach (and of the variant of [10]). The main presentation in [17] is based on approximate complex arithmetic, and the paper includes a sketch of a variant based on modular arithmetic that also achieves .

In a recent preprint, the first author announced that in the complex arithmetic case, the constant may be reduced to , by taking advantage of new methods for truncated integer multiplication [14]. This improvement does not seem to apply to the modular variants.

The first main result of this paper is the following further improvement.

Theorem 1.1.

There is an integer multiplication algorithm that achieves

In other words, (1.2) holds with . The proof is given in Section 5. The new algorithm works with modular arithmetic throughout; curiously, there is no obvious analogue based on approximate complex arithmetic.

There have been several proposals in the literature for algorithms that achieve under various unproved number-theoretic conjectures: see [17, §9], [15], and [8]. Whether can be reached unconditionally remains an important open question.

1.2. Polynomial multiplication over finite fields

For a prime , let denote the number of bit operations required to multiply two polynomials in of degree less than . The optimal choice of algorithm for this problem depends very much on the relative size of and .

If is not too large compared to , say , then a reasonable choice is Kronecker substitution: one lifts the polynomials to , packs the coefficients of each polynomial into a large integer (i.e., evaluates at for ), multiplies these large integers, unpacks the resulting coefficients to obtain the product in , and finally reduces the output modulo . This leads to the bound


where is any admissible constant in (1.2). To the authors’ knowledge, this is the best known asymptotic bound for in the region .

When is large compared to , the situation is starkly different. The Kronecker substitution method leads to poor results, due to coefficient growth in the lifted product: for example, when is fixed, Kronecker substitution yields

For many years, the best known bound in this regime was that achieved by the algebraic version of the Schönhage–Strassen algorithm [23, 22], namely


The first term arises from performing additions in , and the second term from multiplications in . (In fact, this sort of bound holds for polynomial multiplication over quite general rings [7].) For fixed , this is faster than the Kronecker substitution method by a factor of almost . The main reason for its superiority is that it exploits the modulo structure throughout the algorithm, whereas the Kronecker substitution method forgets this structure in the very first step.

After the appearance of Fürer’s algorithm, it was natural to ask whether a Fürer-type bound could be proved for , in the case that is large compared to . This question was answered in the affirmative by Harvey, van der Hoeven and Lecerf, who gave an algorithm that achieves

uniformly for all and [18]. This is a very elegant bound; however, written in this way, it obscures the fact that the constant plays two quite different roles in the complexity analysis. One source of the value is the constant arising from the integer multiplication algorithm mentioned above, but there is also a separate constant arising from the polynomial part of the algorithm. There is no particular reason to expect that , and it is somewhat of a coincidence that they have the same numerical value in [18].

To clarify the situation, we mention that one may derive a complexity bound for the algorithm of [18] under the assumption that one has available an integer multiplication algorithm achieving (1.2) for some , where possibly . Namely, one finds that


where (we omit the proof). The second main result of this paper, proved in Section 4, is the following improvement in the value of .

Theorem 1.2.

Let be any constant for which (1.2) holds (for example, by Theorem 1.1, one may take ). Then there is a polynomial multiplication algorithm that achieves


uniformly for all and all primes .

In other words, (1.5) holds with . In particular, for fixed , one can multiply polynomials in of degree in bit operations.

Theorem 1.2 may be generalised in various ways. We briefly mention a few possibilities along the lines of [18, §8] (no proofs will be given). First, we may obtain analogous bit complexity bounds for multiplication in and for , and in for arbitrary (see Theorems 8.1–8.3 in [18]). We may also obtain complexity bounds for polynomial multiplication in various algebraic complexity models. For example, we may construct a straight-line program that multiplies two polynomials in of degree less than , for any -algebra , using additions and scalar multiplications and nonscalar multiplications (compare with [18, Thm. 8.4]).

1.3. Overview of the new algorithms

To explain the new approach, let us first recall the idea behind the polynomial multiplication algorithm of [18].

Consider a polynomial multiplication problem in , where the degree is very large compared to . By splitting the inputs into chunks, we convert this to a bivariate multiplication problem in , for a suitable integer  and irreducible polynomial

. This bivariate product is handled by means of DFTs (discrete Fourier transforms) of length

over . The key innovation of [18] was to choose so that is divisible by many small primes, so many, in fact, that their product is comparable to , even though itself is exponentially smaller. This is possible thanks to a number-theoretic result of Adleman, Pomerance and Rumely [1], building on earlier work of Prachar [21]. Taking to be a product of many of these primes, we obtain , and hence contains a root of unity of order . As is highly composite, each DFT of length may be converted to a collection of much smaller DFTs via the Cooley–Tukey method. These in turn are converted into multiplication problems using Bluestein’s algorithm. These multiplications, corresponding to exponentially smaller values of , are handled recursively.

The recursion continues until becomes comparable to . The number of recursion levels during this phase is , and the constant represents the expansion factor at each recursion level. When becomes comparable to , the algorithm switches strategy to Kronecker substitution combined with ordinary integer multiplication. This phase contributes the term.

It was pointed out in [16, §8] that the value of can be improved to if one is willing to accept certain unproved number-theoretic conjectures, including Artin’s conjecture on primitive roots. More precisely, under these conjectures, one may find an irreducible of the form , where  is prime, so that is a direct summand of . This last ring is isomorphic to , and one may use this isomorphism to save a factor of two in zero-padding at each recursion level. These savings lead directly to the improved value for .

To prove Theorem 1.2, we will pursue a variant of this idea. We will take to be a cyclotomic polynomial for a judiciously chosen integer (not necessarily prime). Since , we may use the above isomorphism to realise the same economy in zero-padding as in the conjectural construction of [16, §8]. However, unlike [16], we do not require that be irreducible in . Thus is no longer in general a field, but a direct sum of fields. The situation is reminiscent of Fürer’s algorithm, in which the coefficient ring is not a field, but a direct sum of copies of . The key technical contribution of this paper is to show that we have enough control over the factorisation of in to ensure that contains suitable principal roots of unity. This approach avoids Artin’s conjecture and other number-theoretic difficulties, and enables us to reach unconditionally. The construction of is the subject of Section 3, and the main polynomial multiplication algorithm is presented in Section 4.

Let us now outline how we go about proving Theorem 1.1 (the integer case). The algorithm is heavily dependent on the polynomial multiplication algorithm just sketched. We take the basic problem to be multiplication in , for arbitrary positive . We choose a collection of small primes , each having around bits, and whose product  has bits. By cutting the input integers into many small chunks, we convert to a multiplication in for a suitable . One technical headache is that  is not necessarily divisible by ; following [17], we deal with this by adapting an idea of Crandall and Fagin [9]. Next, by the Chinese remainder theorem, we reduce to multiplying in for each  separately. This is reminisicent of Pollard’s algorithm [20], but instead of using three primes, here the number of primes grows with . At this stage, the coefficient size is doubly exponentially smaller than . We perform these multiplications in by applying two recursion levels of the polynomial multiplication algorithm of Theorem 1.2. This reduces the problem to a collection of multiplication problems in , each doubly exponentially smaller than the original problem. Using Kronecker substitution, these are converted back to multiplications in , where is doubly exponentially smaller than , and the algorithm is applied recursively.

In effect, each recursive call in the new integer multiplication algorithm corresponds to two recursion levels of the existing Fürer-type algorithms. The speedup relative to [17] may be understood as follows. In the algorithm of [17], at each recursion level we incur a factor of two in overhead due to the zero-padding that occurs when we split the inputs into small chunks. In the new algorithm, the passage from to manages the same exponential size reduction without any zero-padding. This roughly corresponds to saving a factor of two at every second recursion level of the algorithm of [17], and explains the factor of overall speedup.

2. Preliminaries

2.1. Logarithmically slow functions

Let , and let be a smooth increasing function. We recall from [17, §5] that is said to be logarithmically slow if there exists an integer such that

as . For example, the functions , , , and are logarithmically slow, with respectively.

We will always assume that is chosen large enough to ensure that for all . According to [17, Lemma 2], this is possible for any logarithmically slow function, and it implies that the iterator is well-defined on . It is shown in [17, Lemma 3] that this iterator satisfies


as . In other words, logarithmically slow functions are more or less indistinguishable from , as far as iterators are concerned.

As in [17] and [18], we will use logarithmically slow functions to measure size reduction in multiplication algorithms. The typical situation is that we have a function measuring the (normalised) cost of a certain multiplication algorithm for inputs of size ; we reduce the problem to a collection of problems of size for some , leading to a bound for in terms of the various

. Applying the reduction recursively, we wish to convert these bounds into an explicit asymptotic estimate for

. This is achieved via the following ‘master theorem’.

Proposition 2.1.

Let , , and let and be integers. Let , and let be a logarithmically slow function such that for all . Assume that is chosen so that is defined for all . Then there exists a positive constant (depending on , , , , , and ) with the following property.

Let and . Let , and let be any function satisfying the following recurrence. First, for all , . Second, for all , , there exist with , and weights with , such that

For all , , we then have


The special case , is exactly [17, Prop. 8]. We indicate briefly how the proof of [17, Prop. 8] must be modified to obtain this more general statement.

The first two paragraphs of the proof of [17, Prop. 8] may be read verbatim. In the third paragraph, the inductive statement is changed to

where . The inductive step is modified slightly: for we use the fact that , and for the fact that . With these changes, the proof given in [17] goes through without difficulty. ∎

2.2. Discrete Fourier transforms

Let and let be a commutative ring in which  is invertible. A principal -th root of unity is an element such that and such that for . If is a divisor of , then is easily seen to be a principal -th root of unity.

Fix a principal -th root of unity . The discrete Fourier transform (DFT) of the sequence with respect to is the sequence defined by . Equivalently, where .

The inverse DFT recovers from . Computationally it corresponds to a DFT with respect to , followed by a division by , because

DFTs may be used to implement cyclic convolutions. Suppose that we wish to compute where . We first perform DFTs to compute and for . We then compute for each , and finally perform an inverse DFT to recover .

This strategy may be generalised to handle a multidimensional cyclic convolution, that is, to compute for

For this, we require that each be invertible in , and that contain a principal -th root of unity for each . Let . We first perform multidimensional DFTs to evaluate and at the points . We then multiply pointwise, and finally recover  via a multidimensional inverse DFT.

Each multidimensional DFT may be reduced to a collection of one-dimensional DFTs as follows. We first compute for each ; this involves DFTs of length . We then recursively evaluate each of these polynomials at the points . Altogether, this strategy involves computing DFTs of length for each .

Finally, we briefly recall Bluestein’s method [5] for reducing a (one-dimensional) DFT to a convolution problem (see also [17, §2.5]). Let

be odd and let

be a principal -th root of unity. Set , so that and . Then computing the DFT of a given sequence with respect to  reduces to computing the product of the polynomials

in , plus auxiliary multiplications in . Notice that is fixed and does not depend on the input sequence.

2.3. The Crandall–Fagin algorithm

Consider the problem of computing a ‘cyclic’ integer product of length , that is, a product where . If and are positive integers such that and , then we may reduce the given problem to multiplication in , by cutting up the integers into chunks of bits. In this section we briefly recall a variant [17, §9.2] of an algorithm due to Crandall and Fagin [9] that achieves the same reduction without the assumption that .

Assume that and , and that we have available some with . (This  plays the same role as the real -th root of  in the original Crandall–Fagin algorithm.) Set and . Observe that or for each . Decompose the inputs as and where (i.e., a decomposition with respect to a ‘variable base’). Set and , regarded as polynomials in , and compute the product . Then one finds (see [17, §9.2]) that the product may be recovered by the formula , where the are integers in defined by .

To summarise, the problem of computing reduces to computing a product in , together with auxiliary multiplications in , and bit operations to compute the and to handle the final overlap-add phase (again, see [17, §9.2] for details).

2.4. Data layout

In this section we discuss several issues relating to the layout of data on the Turing machine tapes.

Integers will always be stored in the standard binary representation. If is a positive integer, then elements of will always be stored as residues in the range , occupying bits of storage.

If is a ring and is a polynomial of degree , then an element of will always be represented as a sequence of  coefficients in the standard monomial order. This convention is applied recursively, so for rings of the type , the coefficient of is stored first, as an element of , followed by the coefficient of , and so on.

A multidimensional array of size , whose entries occupy bits each, will be stored as a linear array of bits. The entries are ordered lexicographically in the order . In particular, an element of is represented as an array of elements of . We will generally prefer the more compact notation .

There are many instances where an array must be transposed so that its entries can be accessed efficiently either ‘by columns’ or ‘by rows’. Using the algorithm of [6, Lemma 18], such a transposition may be achieved in bit operations, where is the bit size of each entry. (The idea of the algorithm is to split the array in half along the short dimension, and transpose each half recursively.)

One important application is the following result, which estimates the data rearrangement cost associated to the the Agarwal–Cooley method [2] for converting between one-dimensional and multidimensional convolution problems (this is closely related to the Good–Thomas DFT algorithm [13, 26]).

Lemma 2.2.

Let be relatively prime, and let be a ring whose elements are represented using bits. There exists an isomorphism

that may be evaluated in either direction in bit operations.


Let , and let

denote the homomorphism that maps to , and acts as the identity on . Suppose that we wish to compute for some input polynomial . Interpreting the list as an array, the -th entry corresponds to . After transposing the array, which costs bit operations, we have an array, whose -th entry is . Now for each , cyclically permute the -th row by slots; altogether this uses only bit operations. The result is an array whose -th entry is , which is exactly the coefficient of in . The inverse map may be computed by reversing this procedure. ∎

Corollary 2.3.

Let be pairwise relatively prime, let , and let be a ring whose elements are represented using bits. There exists an isomorphism

that may be evaluated in either direction in bit operations.


Using Lemma 2.2, we may construct a sequence of isomorphisms

the -th of which may be computed in bit operations. The overall cost is bit operations. ∎

3. Cyclotomic coefficient rings

The aim of this section is to construct certain coefficient rings that play a central role in the multiplication algorithms described later. The basic idea is as follows. Suppose that we want to multiply two polynomials in , and that the degree of the product is known to be at most . If is an integer with , then by appropriate zero-padding, we may embed the problem in . Furthermore, if we have some factorisation , where and are relatively prime, then there is an isomorphism

and the latter ring is closely related to

(recall that denotes the -th cyclotomic polynomial). In particular, computing the product in recovers ‘most’ of the information about the product in .

In this section we show how to choose , and with the following properties:

  1. is not much larger than , so that not too much space is ‘wasted’ in the initial zero-padding step;

  2. () is not much smaller than , so that we do not lose much information by working modulo instead of modulo (this missing information must be recovered by other means);

  3. the coefficient ring contains a principal -th root of unity, so that we can multiply in efficiently by means of DFTs over ;

  4. is a product of many integers that are exponentially smaller than , so that the DFTs of length may be reduced to many small DFTs; and

  5. is itself exponentially smaller than .

The last two items ensure that the small DFTs can be converted to multiplication problems of degree exponentially smaller than , to allow the recursion to proceed.

Definition 3.1.

An admissible tuple is a sequence of distinct primes () satisfying the following conditions. First,


where . Second, is squarefree for , and


(Note that need not be squarefree, and does not participate in (3.2).)

An admissible length is a positive integer of the form where is an admissible tuple.

If is an admissible length, we treat and as auxiliary data attached to . For example, if an algorithm takes  as input, we implicitly assume that this auxiliary data is also supplied as part of the input.

Example 3.2.

For , there is a nearby admissible length



Definition 3.3.

Let be a prime. An admissible length is called -admissible if and (i.e., is distinct from ).

The following result explains how to choose a -admissible length close to any prescribed target.

Proposition 3.4.

There is an absolute constant with the following property. Given as input a prime and an integer , in bit operations we may compute a -admissible length in the interval


The key ingredient in the proof is the following number-theoretic result of Adleman, Pomerance and Rumely.

Lemma 3.5 ([1, Prop. 10]).

There is an absolute constant with the following property. For all , there exists a positive squarefree integer such that

Proof of Proposition 3.4.

Let , and for define to be the number of primes in the interval such that and . We claim that, provided is large enough, there exists some squarefree such that . To see this, apply Lemma 3.5 with ; for large we then have

so Lemma 3.5 implies that there exists a positive squarefree integer for which

and hence

We may locate one such by means of the following algorithm (adapted from the proof of [18, Lemma 4.5]). First use a sieve to enumerate the primes in the interval , and to determine which are squarefree, in bit operations. Now initialise an array of integers for . For each , scan through the array, incrementing those  for which is squarefree and divisible by , and stop as soon as one of the  reaches . We need only allocate bits per array entry, so each pass through the array costs bit operations. The number of passes is , so the total cost of finding a suitable is bit operations. Within the same time bound, we may also easily recover a list of primes for which .

Next, compute the partial products , , …, , and determine the smallest integer for which . Such an certainly exists, as . Since each occupies bits, this can all be done in bit operations. Also, as

and , we find that

for large .

Let be the least prime that exceeds and that is distinct from . According to [4], the interval contains at least one prime for all sufficiently large ; therefore

for sufficiently large. We may find in bit operations, by using trial division to test successive integers for primality.

Set . Then (3.3) holds, and certainly and . Let us check that is admissible, provided is large enough. For we have

and also

this establishes (3.1). Also, as