## 1. Introduction

Let and let and be integers in the interval .
We write for the cost of computing the *full product* of and , which is just the usual -bit product

. Unless otherwise specified, by ‘cost’ we mean the number of bit operations, under a model such as the multitape Turing machine

[12].In this paper we are interested in two types of *truncated product*.
The *low product* of and is the unique integer in the interval such that , or in other words, the low bits of .
We denote the cost of computing the low product by .

The *high product* of and is defined to be any integer in the range such that .
Thus there are at most two possible values for the high product, and an algorithm that computes it is permitted to return either one.
The high product consists of, more or less, the high bits of , except that we allow a small error in the lowest bit.
We denote the cost of computing the high product by .

There are many applications of truncated products in computer arithmetic. The most obvious example is high-precision arithmetic on real numbers: to compute an -bit approximation to the product of two real numbers with -bit mantissae, one may scale by an appropriate power of two to convert the inputs into -bit integers, and then compute the high product of those integers. Further examples include Barrett’s and Montgomery’s algorithms for modular arithmetic [1, 10].

It is natural to ask whether a truncated product can be computed more quickly than a full product. This is indeed the case for small : in the classical quadratic-time regime, one can compute a truncated product in about half the time of a full product, because essentially only half of the bit-by-bit products contribute to the desired output.

However, as grows, and more sophisticated multiplication algorithms are deployed, these savings begin to dissipate. Consider for instance Karatsuba’s algorithm, which has complexity for . Mulders showed [11] that one can adapt Karatsuba’s algorithm to obtain bounds for and around . However, it is not known how to reach in this regime.

For much larger values of

, the most efficient integer multiplication algorithms known are based on FFTs (fast Fourier transforms). The complexity of these algorithms is of the form

; see [9] for the smallest term presently known.It has long been thought that the best way to compute a truncated product using FFT-based algorithms is simply to compute the full product and then discard the unwanted part of the output. One might be able to save bit operations compared to the full product, by skipping computations that do not contribute to the desired half of the output, but no bounds of the type or have been proved for any constant .

For some closely related problems, one can actually *prove* that it is not possible to do better than computing the full product.
For example, in a suitable algebraic model, the multiplicative complexity of any algorithm that computes the low coefficients of the product of two polynomials of degree less than is at least [4, Thm. 17.14], which is the same as the multiplicative complexity of the full product.
By analogy, one might expect the same sort of lower bound to apply to truncated integer multiplication.

In this paper we show that this belief is mistaken: we will present algorithms that compute high and low products of integers in asymptotically of the time required for a full product. The new algorithms require that the underlying integer multiplication is carried out via a cyclic convolution of sequences of real numbers. This includes any real convolution algorithm based on FFTs.

Unfortunately, because the new methods rely heavily on the archimedean property of , we do not yet know how to obtain this 25% reduction in complexity for arbitrary integer multiplication algorithms. In particular, we are currently unable to establish analogous results for integer multiplication algorithms based on FFTs over other rings, such as finite fields.

The remainder of the paper is structured as follows. In Section 2 we state our main results precisely, after making some preliminary definitions. Section 3 presents the new algorithm for the low product, including the proof of correctness and complexity analysis. Section 4 does the same for the high product. Section 5 gives some performance data for an implementation of the new algorithms. Finally, Section 6 sketches several further results that can be proved using the methods introduced in this paper.

## 2. Setup and statement of results

### 2.1. Fixed point arithmetic and real convolutions

We write for . To simplify analysis of numerical error, all algorithms are assumed to work with the following fixed-point representation for real numbers (see [9, §3] for a more detailed treatment). Let be a precision parameter. We write for the set of real numbers of the form where is an integer in the interval . Thus models the unit interval , and elements of are represented using bits of storage. For , we write for the set of real numbers of the form where . An element of is represented simply by its mantissa in ; the exponent is always known from context, and is not explicitly stored.

We will frequently work with quotient rings of the form where is some fixed monic polynomial of positive degree, such as . If and , we write for the coefficients of with respect to the standard monomial basis; that is, . For such we define a norm

The expression indicates the set of polynomials whose coefficients lie in ; this is a slight abuse of notation, as

is not really a ring. Algorithms always represent such a polynomial by its coefficient vector

.We assume that we have available a subroutine Convolution with the following properties. It takes as input two parameters and , and polynomials

Let ; more explicitly,

Then Convolution is required to output a polynomial

such that

In other words, Convolution computes a -bit approximation to the cyclic convolution of two real input sequences of length .

We write for the bit complexity of Convolution. We treat this routine as a black box; its precise implementation is not important for our purposes. A typical implementation would execute a real-to-complex FFT for each input sequence, multiply the Fourier coefficients pointwise, and then compute an inverse complex-to-real transform to recover the result. Internally, it should work to precision slightly higher than to manage rounding errors during intermediate computations (for an explicit error bound, see for example [3, Theorem 3.6]). The routine may completely ignore the exponent parameter .

### 2.2. The full product

For completeness, we recall the well-known algorithm that uses Convolution to compute the full product of two -bit integers (Algorithm 1). It depends on two parameters: a chunk size , and a transform length , where . The idea is to cut the integers into chunks of bits, thereby converting the integer multiplication problem into the problem of multiplying two polynomials in modulo .

We will not discuss in this paper the question of optimising the choice of and . The optimal choice of will involve some balance between making as close to as possible, but also ensuring that is sufficiently smooth (has only small prime factors) so that FFTs of length are as efficient as possible. (An alternative approach is to use “truncated FFTs” [13], which eliminates the need to choose a smooth transform length. However, this makes no difference asymptotically. Despite the overlapping terminology, it is not clear whether the new truncated multiplication algorithms can be adapted to the case of truncated FFTs.)

Integers

*.*

###### Theorem 2.1 (Full product).

Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 1 correctly computes the full product of and . Assuming that , its complexity is

###### Proof.

The condition ensures that the decompositions of and into and in line 2 are legal. Let

and

Note that and are the images of and in , and by construction and . Since has degree at most , it is determined by its remainder modulo . Line 3 computes an approximation to this remainder with

for each . The function in line 4 rounds its argument to the nearest integer (ties broken in either direction as convenient). Since , we deduce that for each ; hence line 4 returns .

The main term in the complexity bound arises from the Convolution call in line 3. The secondary term consists of the splitting step in line 2, which costs bit operations per coefficient, and the overlap-add procedure in line 4, which requires bit operations per coefficient. ∎

### 2.3. Statement of results

The main results of this paper are the following analogues of Theorem 2.1 for the low product and high product. These are proved in Section 3 and Section 4 respectively.

###### Theorem 2.2 (Low product).

Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 2 correctly computes the low product of and . Assuming that , its complexity is

(2.1) |

###### Theorem 2.3 (High product).

Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 3 correctly computes the high product of and . Assuming that , its complexity is

(2.2) |

Comparing these complexity bounds to Theorem 2.1, we observe that the convolution length has dropped from to , but the working precision has increased from roughly to roughly . To understand the implications for the overall complexity, we need to make further assumptions on the growth of as a function of . We consider two scenarios.

Scenario #1: asymptotic behaviour as .
We assume that and are chosen so that as .
We also assume that is restricted to suitably smooth values (for instance, the *ultrasmooth numbers* defined by Bernstein [2]), and that is chosen to be exponentially smaller than but somewhat larger than , say and (as is done for example in [9, §6]).
This choice of ensures that

as , and similarly for .

Under these assumptions, it is reasonable to expect that the complexity of the underlying real convolution is quasi-linear with respect to the total bit size of the input, i.e., behaves roughly like . This is the case for all FFT-based convolution algorithms known to the author. We conclude that

as , and similarly for the high product. This justifies our assertion that asymptotically the new truncated product algorithms save 25% of the total work compared to the full product.

Scenario #2: fixed word size. Now let us consider the situation faced by a programmer working on a modern microprocessor with hardware support for a fixed word size, such as the 53-bit double-precision floating point type provided by the IEEE 754 standard. In this setting, the Convolution subroutine takes as input two vectors of coefficients represented by this data type, and computes their cyclic convolution using some sort of FFT, taking direct advantage of the hardware arithmetic. We assume that is chosen as large as possible so that the FFTs can be performed in this way; for example, under IEEE 754 we would require that for the low product, where is an allowance for numerical error. Obviously in this scenario it does not make sense to allow , and it also does not quite make sense to measure complexity by the number of “bit operations”. Instead, should be restricted to lie in some finite (possibly quite large) range, and a more natural measure of complexity is the number of word operations (ignoring issues such as locality and parallelism).

We claim that it is still reasonable to expect a reduction in complexity close to 25%. To see this, consider a full product computation for a given , with splitting parameters and . Let and be the splitting parameters for the corresponding truncated product (for the same ). We should choose around to ensure that we still take maximum advantage of the available floating-point type. Then we should choose around to compensate for the smaller chunks. Now observe that (for large ) the bulk of the work for the full product consists of FFTs of length , but for the truncated products the FFT length is reduced to around . Since the FFTs run in quasilinear time (word operations), we expect to see roughly 25% savings. In practice this will be tempered somewhat by the additional linear-time work inherent in the truncated product algorithms, such as the the evaluation of and in Algorithm 2. The situation is also complicated by the fact that we are constrained to choose smooth transform lengths. Section 5 gives some empirical evidence in favour of this conclusion.

## 3. The low product

### 3.1. The cancellation trick

The key to the new low product algorithm is the following simple observation.

###### Proposition 3.1.

Let with . Let be the remainder on dividing by

with . Then , , and

###### Proof.

Write

Since , we obtain

This shows that . Moreover,

Later we will apply Proposition 3.1 to a polynomial analogous to the encountered earlier in the proof of Theorem 2.1. The proposition shows that after reducing modulo and making the substitution , the term in causes the unwanted high-order coefficients of to disappear. An alternative point of view is that polynomial multiplication modulo corresponds roughly to integer multiplication modulo

To make use of Proposition 3.1 to compute a low product, we must compute *exactly*.
Note that the coefficients of lie in rather than .
Consequently, to compute , we must increase the working precision by bits compared to the precision used in the full product algorithm.
This is why the precision parameter in Theorem 2.2 (and Theorem 2.3) is instead of .

### 3.2. The roots of

In this section we study the roots of the special polynomial introduced in Proposition 3.1. For , let denote the open disc .

###### Lemma 3.2.

The roots of lie in , and they are all simple.

###### Proof.

Any multiple root of would have to satisfy

and hence . This implies that , contradicting . ∎

Now consider the function

where means the branch that maps to .

###### Lemma 3.3.

The function maps roots of to roots of .

###### Proof.

If is a root of , then

In fact, always sends a root of to the root of nearest to it, but we will not prove this. Figure 1 illustrates the situation for and , showing that the roots of are very close to those of . (For the roots are already too close together to distinguish at this scale.)

For any , the binomial theorem implies that is represented on by the series

where

In particular, the first few terms of are

We will need to construct an explicit functional inverse for , in order to map the roots of back to roots of . Let be the formal power series inverse of , i.e., so that

The coefficients of , and of its powers, are given as follows.

###### Lemma 3.4.

For any we have (formally)

where and

In particular, the first few terms of are

###### Proof.

By the Lagrange inversion formula (for example, in equation (2.1.2) of [6], set and ), we find that is equal to the coefficient of in

Therefore for we have

###### Lemma 3.5.

For all and we have

###### Proof.

The bounds are trivial for , so assume that . Then

and

###### Corollary 3.6.

The series for and converge on and respectively, and

(3.2) |

###### Proof.

We already know that converges on , and the convergence of on follows from Lemma 3.5. If , then

so maps into . A similar argument shows that maps into . Since both and map into the disc of convergence of the other, and since they are inverses formally, they must be inverse functions in the sense of (3.2). ∎

###### Corollary 3.7.

The functions and induce mutually inverse bijections between the roots of and the roots of .

### 3.3. Ring isomorphisms

The aim of this section is construct a pair of mutually inverse ring isomorphisms

In the main low product algorithm, the role of these maps will be to convert the problem of multiplying two polynomials modulo into an ordinary cyclic convolution.

The idea of the construction is that for , we want to define to be the composition , considered as a polynomial modulo . The map is defined similarly. However, since and are not polynomials, there are convergence issues to consider. We now proceed to give a formal construction of and in terms of the power series expansions of and , to make all of this more precise.

For each define linear maps

by the formulas

These maps satisfy the following norm bounds.

###### Lemma 3.8.

For any and ,

###### Proof.

For any we have , because multiplication by simply permutes the coefficients cyclically. Applying this observation to , by Lemma 3.5 we find that

###### Lemma 3.9.

For any and ,

###### Proof.

We now define and by setting

Lemma 3.8 and Lemma 3.9 guarantee that these series converge coefficientwise, so and

are well-defined, and they are clearly linear maps. Moreover, we immediately obtain the following estimates.

###### Lemma 3.10.

For any and any we have

###### Proof.

###### Lemma 3.11.

For any and any we have

###### Proof.

Similar to the proof of Lemma 3.10. ∎

Now we can establish that and behave like the desired compositions and .

###### Lemma 3.12.

Let , and let be a root of . Then

###### Proof.

By the definition of ,

so

###### Lemma 3.13.

Let , and let be a root of . Then

###### Proof.

Similar to the proof of Lemma 3.12. ∎

###### Corollary 3.14.

The maps and are mutually inverse ring isomorphisms between and .

###### Proof.

Lemma 3.12 implies that

for any and any root of . Since a polynomial in is determined by its values at the roots of , this shows that , and hence that is a ring homomorphism. A similar argument shows that is a ring homomorphism. Finally, if and is a root of , then Corollary 3.7 implies that

Therefore and are inverses. ∎

Finally, we have the following two results concerning the complexity of approximating and .

###### Proposition 3.15 (Approximating ).

Given as input , we may compute such that

in bit operations, assuming that .

Comments

There are no comments yet.