Faster integer multiplication using short lattice vectors

02/22/2018
by   David Harvey, et al.
UNSW
0

We prove that n-bit integers may be multiplied in O(n log n 4^log^* n) bit operations. This complexity bound had been achieved previously by several authors, assuming various unproved number-theoretic hypotheses. Our proof is unconditional, and depends in an essential way on Minkowski's theorem concerning lattice vectors in symmetric convex sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/11/2017

Faster integer and polynomial multiplication using cyclotomic coefficient rings

We present an algorithm that computes the product of two n-bit integers ...
11/22/2016

Faster integer multiplication using plain vanilla FFT primes

Assuming a conjectural upper bound for the least prime in an arithmetic ...
01/31/2020

Exact and Robust Reconstruction of Integer Vectors Based on Multidimensional Chinese Remainder Theorem (MD-CRT)

The robust Chinese remainder theorem (CRT) has been recently proposed fo...
11/17/2019

Faster Integer Multiplication Using Preprocessing

A New Number Theoretic Transform(NTT), which is a form of FFT, is introd...
03/02/2017

Faster truncated integer multiplication

We present new algorithms for computing the low n bits or the high n bit...
06/06/2021

Low-complexity Voronoi shaping for the Gaussian channel

Voronoi constellations (VCs) are finite sets of vectors of a coding latt...
12/23/2015

A Latent-Variable Lattice Model

Markov random field (MRF) learning is intractable, and its approximation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Let denote the number of bit operations required to multiply two

-bit integers, where “bit operations” means the number of steps on a deterministic Turing machine with a fixed, finite number of tapes

[18] (our results also hold in the Boolean circuit model). Let denote the iterated natural logarithm, i.e., , where (iterated  times). The main result of this paper is an algorithm achieving the following bound.

Theorem 1.1.

We have .

The first complexity bound for of the form was established by Fürer [8, 9], for an unspecified constant . His algorithm reduces a multiplication of size to many multiplications of size exponentially smaller than , which are then handled recursively. The number of recursion levels is , and the constant  measures the “expansion factor” at each recursion level.

The first explicit value for , namely , was given by Harvey, van der Hoeven and Lecerf [13]. Their method is somewhat different to Fürer, but still carries out an exponential size reduction at each recursion level. One may think of the constant as being built up of three factors of , each coming from a different source.

The first factor of

arises from the need to perform both forward and inverse DFTs (discrete Fourier transforms) at each recursion level. This is a feature common to all of the post-Fürer algorithms, suggesting that significantly new ideas will be needed to do any better than

.

The second factor of arises from coefficient growth: a product of polynomials with -bit integer coefficients has coefficients with at least bits. This factor of also seems difficult to completely eliminate, although Harvey and van der Hoeven have recently made some progress [12]: they achieve by arranging that, in effect, the coefficient growth only occurs at every second recursion level. This was the best known unconditional value of prior to the present paper.

The final factor of occurs because the algorithm works over : when multiplying complex coefficients with say significant bits, the algorithm first computes a full -bit product, and then truncates to bits. More precisely, after splitting the -bit inputs into exponentially smaller chunks, and encoding them into polynomials of degree , the algorithm must compute the full product of degree , even though essentially only coefficients are needed to resolve significant bits of the product. Again, this factor of has been the subject of a recent attack: Harvey has shown [10] that it is possible to work modulo a polynomial of degree only , at the expense of increasing the working precision by a factor of . This leads to an integer multiplication algorithm achieving .

Another way of attacking this last factor of is to replace the coefficient ring  by a finite ring for a suitable integer . By choosing with some special structure, it may become possible to convert a multiplication modulo directly into a polynomial multiplication modulo some polynomial of degree , rather than . Three algorithms along these lines have been proposed.

First, Harvey, van der Hoeven and Lecerf suggested using Mersenne primes, i.e., primes of the form , where is itself prime [13, §9]. They convert multiplication in to multiplication in , where is a power of two. Because is not divisible by , the process of splitting an element of into chunks is somewhat involved, and depends on a variant of the Crandall–Fagin algorithm [7].

Covanov and Thomé [6] later proposed using generalised Fermat primes, i.e., primes of the form , where is a power of two and is a small even integer. Here, multiplication in is converted to multiplication in . The splitting procedure consists of rewriting an element of in base , via fast radix-conversion algorithms.

Finally, Harvey and van der Hoeven [11] proposed using FFT primes, i.e., primes of the form , where is small. They reduce multiplication in to multiplication in via a straightforward splitting of the integers into  chunks, where is a power of two. Here the splitting process is trivial, as may be chosen to be divisible by .

These three algorithms all achieve , subject to plausible but unproved conjectures on the distribution of the relevant primes. Unfortunately, in all three cases, it is not even known that there are infinitely many primes of the required form, let alone that there exist a sufficiently high density of them to satisfy the requirements of the algorithm.

The main technical novelty of the present paper is a splitting procedure that works for an almost arbitrary modulus . The core idea is to introduce an alternative representation for elements of : we represent them as expressions , where is some fixed -th root of unity in , and where the are small integers, of size roughly . Essentially the only restriction on is that must contain an appropriate -th root of unity. We will see that Linnik’s theorem is strong enough to construct suitable such moduli .

In Section 2 we show that the cost of multiplication in this representation is only a constant factor worse than for the usual representation. The key ingredient is Minkowski’s theorem on lattice vectors in symmetric convex sets. We also give algorithms for converting between this representation and the standard representation. The conversions are not as fast as one might hope — in particular, we do not know how to carry them out in quasilinear time — but surprisingly this turns out not to affect the overall complexity, because in the main multiplication algorithm we perform the conversions only infrequently.

Then in Sections 3 and 4 we prove Theorem 1.1, using an algorithm that is structurally very similar to [11]. We make no attempt to minimise the implied big- constant in Theorem 1.1; our goal is to give the simplest possible proof of the asymptotic bound, without any regard for questions of practicality.

An interesting question is whether it is possible to combine the techniques of the present paper with those of [12] to obtain an algorithm achieving . Our attempts in this direction have so far been unsuccessful. One might also ask if the techniques of this paper can be transferred to the case of multiplication of polynomials of high degree in . However, this is not so interesting, because an unconditional proof of the bound corresponding to in the polynomial case is already known [12].

Throughout the paper we use the following notation. We write for , and for convenience put . We define , where is some constant so that the Schönhage–Strassen algorithm multiplies -bit integers in at most bit operations [19]. This function satisfies for any , and also for fixed . An -bit integer may be divided by -bit integer, producing quotient and remainder, in time [20, Ch. 9]. We may transpose an array of objects of bit size in bit operations [4, Appendix]. Finally, we occasionally use Xylouris’s refinement of Linnik’s theorem [21], which states that for any relatively prime positive integers and , the least prime in the arithmetic progression satisfies .

2. -representations

Throughout this section, fix an integer and a power of two such that

(2.1)

and assume we are given some such that .

For a polynomial , define . This norm satisfies for any .

Definition 2.1.

Let . A -representation for is a polynomial such that and .

Example 2.2.

Let and

The coefficients in a -representation must not exceed . Two examples of -representations for are

By (2.1), the number of bits required to store is at most

so a -representation incurs very little overhead in space compared to the standard representation by an integer in the interval .

Our main tool for working with -representations is the reduction algorithm stated in Lemma 2.9 below. Given a polynomial , whose coefficients are up to about twice as large as allowed in a -representation, the reduction algorithm computes a -representation for (up to a certain scaling factor, discussed further below). The basic idea of the algorithm is to precompute a nonzero polynomial such that , and then to subtract an appropriate multiple of from to make the coefficients small.

After developing the reduction algorithm, we are able to give algorithms for basic arithmetic on elements of given in -representation (Proposition 2.15), a more general reduction algorithm for inputs of arbitrary size (Proposition 2.17), and algorithms for converting between standard and -representation (Proposition 2.18 and Proposition 2.20).

We begin with two results that generate certain precomputed data necessary for the main reduction step.

Lemma 2.3.

In bit operations, we may compute a nonzero polynomial such that and .

Proof.

We first establish existence of a suitable . Let denote a lift of to , and consider the lattice spanned by the rows of the integer matrix

Every vector satisfies the equation . The volume of the fundamental domain of is . The volume of the closed convex symmetric set is , so by Minkowski’s theorem (see for example [14, Ch. V, Thm. 3]), there exists a nonzero vector in . The corresponding polynomial then has the desired properties.

To actually compute , we simply perform a brute-force search. By (2.1) there are at most candidates to test. Enumerating them in lexicographical order, we can easily evaluate in an average of bit operations per candidate. ∎

Example 2.4.

Continuing Example 2.2, the coefficients of must not exceed . A suitable polynomial is given by

Remark 2.5.

The computation of is closely related to the problem of finding an element of small norm in the ideal of the ring generated by and , where denotes a primitive -th root of unity.

Remark 2.6.

The poor exponential-time complexity of Lemma 2.3

can probably be improved, by taking advantage of more sophisticated lattice reduction or shortest vector algorithms, but we were not easily able to extract a suitable result from the literature. For example, LLL is not guaranteed to produce a short enough vector

[15], and the Micciancio–Voulgaris exact shortest vector algorithm [16] solves the problem for the Euclidean norm rather than the uniform norm. In any case, this has no effect on our main result.

Lemma 2.7.

Assume that has been precomputed as in Lemma 2.3. Let be the smallest prime exceeding such that and such that is invertible in . Then , and in bit operations we may compute and a polynomial such that and .

Proof.

Let be the resultant of (regarded as a polynomial in ) and . The primes dividing are exactly the primes for which fails to be invertible in . Therefore our goal is to find a prime such that .

Since is a power of two, is a cyclotomic polynomial and hence irreducible in . Thus and have no common factor, and so . Also, we have where runs over the complex roots of . These roots all lie on the unit circle, so , and hence by (2.1) we obtain .

On the other hand, the prime number theorem (in the form , see for example [1, §4.3]) implies that there exists an absolute constant such that for any we have (sum taken over primes). Taking , by (2.1) again we get

In particular, there must be at least one prime in the interval that does not divide .

To find the smallest such , we first make a list of all primes up to in bit operations. Then for each prime between and , we check whether divides in bit operations, and attempt to invert in in bit operations [20, Ch. 11]. ∎

Example 2.8.

Continuing Example 2.2, we have and

Now we come to the main step of the reduction algorithm, which is inspired by Montgomery’s method for modular reduction [17].

Lemma 2.9.

Assume that , and have been precomputed as in Lemmas 2.3 and 2.7. Given as input with , we may compute a -representation for in bit operations.

Proof.

We first compute the “quotient” , normalised so that . This is done by means of Kronecker substitution [20, Ch. 8], i.e., we pack the polynomials and into integers, multiply the integers, unpack the result, and reduce the result modulo and modulo . The packed integers have at most bits, where the term accounts for coefficient growth in . By (2.1) and Lemma 2.7, this simplifies to bits, so the integer multiplication step costs bit operations. This bound also covers the cost of the reductions modulo .

Next we compute the product , again using Kronecker substitution, at a cost of bit operations. Since and , we have .

By construction of we have . In particular, all the coefficients of are divisible by . The last step is to compute the “remainder” ; again, this step costs bit operations. Since , we have

Finally, since , and all arithmetic throughout the algorithm has been performed modulo , we see that . ∎

Using the above reduction algorithm, we may give preliminary addition and multiplication algorithms for elements of in -representation.

Lemma 2.10.

Assume that , and have been precomputed as in Lemmas 2.3 and 2.7. Given as input -representations for , we may compute -representations for and in bit operations.

Proof.

Let the -representations be given by . We may compute in using Kronecker substitution in bit operations, and in bit operations. Note that , and , so we may apply Lemma 2.9 to obtain the desired -representations. ∎

Example 2.11.

Continuing Example 2.2, we walk through an example of computing a product of elements in -representation. Let

Suppose we are given as input the -representations

We first compute the product of and modulo :

We multiply by and reduce modulo to obtain the quotient

Then the remainder

is a -representation for .

The following precomputation will assist in eliminating the spurious factor appearing in Lemmas 2.9 and 2.10.

Lemma 2.12.

Assume that , and have been precomputed as in Lemmas 2.3 and 2.7. In bit operations, we may compute a polynomial such that and .

Proof.

We may easily compute the totient function in bit operations, by first factoring . Since , we have . Repeatedly using the identity , we may compute -representations for by successively applying Lemma 2.10. ∎

Remark 2.13.

Assuming the factorisation of is known (which will always be the case in the application in Section 3), the complexity of Lemma 2.12 may be improved to bit operations by using a modified “repeated squaring” algorithm.

Example 2.14.

Continuing Example 2.2, we may take

Henceforth we write for the tuple of precomputed data generated by Lemmas 2.3, 2.7, and 2.12. Given , and as input, the above results show that we may compute in bit operations. With these precomputations out of the way, we may state complexity bounds for the main operations on -representations.

Proposition 2.15.

Assume that has been precomputed. Given as input -representations for , we may compute -representations for and in bit operations.

Proof.

For the product, we first use Lemma 2.10 to compute a -representation for , and then we use Lemma 2.10 again to multiply by , to obtain a -representation for . The sum and difference are handled similarly. ∎

Remark 2.16.

We suspect that the complexity bound for can be improved to , but we do not currently know how to achieve this. This question seems closely related to Remark 2.22 below.

Proposition 2.17.

Assume that has been precomputed. Given as input a polynomial (with no restriction on ), we may compute a -representation for in time .

Proof.

Let and , so that

We may therefore decompose the coefficients of into chunks of  bits, i.e., we may compute polynomials such that and . (This step implicitly requires an array transposition of cost .) Now we use Proposition 2.15 repeatedly to compute a -representation for via Horner’s rule, i.e., first we compute a -representation for , then for , and so on. ∎

Proposition 2.18.

Assume that has been precomputed. Given as input an element in standard representation, we may compute a -representation for in bit operations.

Proof.

Simply apply Proposition 2.17 to the constant polynomial , noting that . ∎

Remark 2.19.

A corollary of Proposition 2.18 is that every admits a -representation. It would be interesting to have a direct proof of this fact that does not rely on the reduction algorithm. A related question is whether it is possible to tighten the bound in the definition of -representation from to , or even . We do not know whether such a representation exists for all .

Proposition 2.20.

Given as input an element in -representation, we may compute the standard representation for in bit operations.

Proof.

Let be the input polynomial. The problem amounts to evaluating in . Again we may simply use Horner’s rule. ∎

Remark 2.21.

In both Proposition 2.18 and Proposition 2.20, the input and output have bit size , but the complexity bounds given are not quasilinear in . It is possible to improve on the stated bounds, but we do not know a quasilinear time algorithm for the conversion in either direction.

Remark 2.22.

In the reduction algorithm, the reader may wonder why we go to the trouble of introducing the auxiliary prime . Why not simply precompute an approximation to a real inverse for , i.e., the inverse in , and use this to clear out the high-order bits of each coefficient of the dividend? In other words, why not replace the Montgomery-style division with the more natural Barrett-style division [2]?

The reason is that we cannot prove tight enough bounds on the size of the coefficients of this inverse: it is conceivable that might accidentally take on a very small value near one of the complex roots of , or equivalently, that the resultant in the proof of Lemma 2.7 might be unusually small. For the same reason, we cannot use a more traditional -adic Montgomery inverse to clear out the low-order bits of the dividend, because again may take a -adically small value near one of the -adic roots of , or equivalently, the resultant might be divisible by an unusually large power of .

3. Integer multiplication: the recursive step

In this section we present a recursive routine Transform with the following interface. It takes as input a (sufficiently large) power-of-two transform length , a prime , a prime power such that

(3.1)

a principal -th root of unity (i.e., an -th root of unity whose reduction modulo is a primitive -th root of unity in the field ), certain precomputed data depending on , and (see below), and a polynomial . Its output is the DFT of with respect to , that is, the vector

The coefficients of both and are given in standard representation.

The precomputed data consists of the tuple defined in Section 2, where  and are defined as follows.

First, (3.1) implies that for large , so we may take to be the unique power of two lying in the interval

(3.2)

Observe that (2.1) is certainly satisfied for this choice of (for large enough ), as (3.1) implies that .

Next, note that , because (3.1) and (3.2) imply that ; therefore we may take , so that .

We remark that the role of the parameter is to give us enough control over the bit size of , to compensate for the fact that Linnik’s theorem does not give us sufficiently fine control over the bit size of (see Lemma 3.5).

Our implementation of Transform uses one of two algorithms, depending on the size of . If is below some threshold, say , then it uses any convenient base-case algorithm. Above this threshold, it reduces the given DFT problem to a collection of exponentially smaller DFTs of the same type, via a series of reductions that may be summarised as follows.

  1. Use the conversion algorithms from Section 2 to reduce to a transform over where the input and output coefficients are given in -representation. (During steps (ii) and (iii) below, all elements of are stored and manipulated entirely in -representation.)

  2. Reduce the “long” transform of length over to many “short” transforms of exponentially small length over , via the Cooley–Tukey decomposition.

  3. Reduce each short transform from step (ii) to a product in , i.e., a cyclic convolution of length , using Bluestein’s algorithm.

  4. Use the definition of -representation to reinterpret each product from (iii) as a product in , where the coefficients in are exponentially smaller than the original coefficients in .

  5. Embed each product from (iv) into for a suitable prime power that is exponentially smaller than , and large enough to resolve the coefficients of the products over .

  6. Reduce each product from (v) to a collection of forward and inverse DFTs of length over , and recurse.

The structure of this algorithm is very similar to that of [11]. The main difference is that it is not necessary to explicitly split the coefficients into chunks in step (iv); this happens automatically as a consequence of storing the coefficients in -representation. In effect, the splitting (and reassembling) work has been shunted into the conversions in step (i).

We now consider each of the above steps in more detail. We write for the running time of Transform. We always assume that is increased whenever necessary to accommodate statements that hold only for large .

Step (i) — convert between representations. Let denote the time required to compute a DFT of length over with respect to , assuming that the coefficients of the input and the output are given in -representation, and assuming that is known.

Lemma 3.1.

We have .

Proof.

We first convert from standard to -representation using Proposition 2.18; we then compute from (working entirely in -representation); at the end, we convert back to standard representation using Proposition 2.20. By (3.1) and (3.2), the total cost of the conversions is

Henceforth all elements of are assumed to be stored in -representation, and we will always use Proposition 2.15 to perform arithmetic operations on such elements in bit operations.

Step (ii) — reduce to short DFTs. Let . Given as input polynomials (presented sequentially on tape), let denote the time required to compute the transforms with respect to the principal -th root of unity . (Here and below, we continue to assume that is known.).

Lemma 3.2.

We have .

Proof.

Let , so that where . Applying the Cooley–Tukey method [5] to the factorisation , the given transform of length may be decomposed into  layers, each consisting of transforms of length  (with respect to ), followed by layers, each consisting of transforms of length . Between each of these layers, we must perform multiplications by “twiddle factors” in , which are given by certain powers of . (For further details of the Cooley–Tukey decomposition, see for example [13, §2.3].)

The total cost of the twiddle factor multiplications, including the cost of computing the twiddle factors themselves, is

This bound also covers the cost of the length  transforms (‘butterflies’), each of which requires one addition and one subtraction in .

In the Turing model, we must also account for the cost of rearranging data so that the inputs for each layer of short DFTs are stored sequentially on tape. The cost per layer is bit operations, so altogether (see [13, §2.3] for further details). ∎

Step (iii) — reduce to short convolutions. Given polynomials as input, let denote the time required to compute the products .

Lemma 3.3.

We have .

Proof.

We use Bluestein’s method [3], which reduces the the problem of computing the DFT of to the problem of computing the product of certain polynomials , plus auxiliary multiplications in (for further details see [13, §2.5]). Here depends on and , but depends only on . The total cost of the auxiliary multiplications is

Step (iv) — reduce to bivariate products over . Given as input polynomials , all whose of coefficients are bounded in absolute value by , let denote the cost of computing the products .

Lemma 3.4.

We have .

Proof.

We are given as input polynomials . Since their coefficients are given in -representation, we may immediately reinterpret them as polynomials , with coefficients bounded by . By definition of -representation, we have , and similarly for the .

After computing the products for , suppose that

Then we have for each . Therefore, to compute the desired products with coefficients in -representation, it suffices to apply Proposition 2.17 to each , to compute -representations for all of the .

Let us estimate the cost of the invocations of Proposition

2.17. We have , so

From (3.2) we have , so for large ,

(3.3)

The cost of applying Proposition 2.17 for all is thus

Step (v) — Reduce to bivariate products over . Let be the smallest prime such that ; by Linnik’s theorem we have . Put where

We have the following bounds for .

Lemma 3.5.

Let be as in the proof of Lemma 3.4, for and . Then and

Proof.

In what follows, we frequently use the fact that