Derivation and Analysis of Fast Bilinear Algorithms for Convolution

by   Caleb Ju, et al.

The prevalence of convolution in applications within signal processing, deep neural networks, and numerical solvers has motivated the development of numerous fast convolution algorithms. In many of these problems, convolution is performed on terabytes or petabytes of data, so even constant factors of improvement can significantly reduce the computation time. We leverage the formalism of bilinear algorithms to describe and analyze all of the most popular approaches. This unified lens permits us to study the relationship between different variants of convolution as well as to derive error bounds and analyze the cost of the various algorithms. We provide new derivations, which predominantly leverage matrix and tensor algebra, to describe the Winograd family of convolution algorithms as well as reductions between 1D and multidimensional convolution. We provide cost and error bounds as well as experimental numerical studies. Our experiments for two of these algorithms, the overlap-add approach and Winograd convolution algorithm with polynomials of degree greater than one, show that fast convolution algorithms can rival the accuracy of the fast Fourier transform (FFT) without using complex arithmetic. These algorithms can be used for convolution problems with multidimensional inputs or for filters larger than size of four, extending the state-of-the-art in Winograd-based convolution algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


An exact, cache-localized algorithm for the sub-quadratic convolution of hypercubes

Fast multidimensional convolution can be performed naively in quadratic ...

Generalized gaussian bounds for discrete convolution powers

We prove a uniform generalized gaussian bound for the powers of a discre...

Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add

Convolutional neural networks (CNNs) are currently state-of-the-art for ...

The numerical approximation of the Schrödinger equation with concentrated potential

We present a family of algorithms for the numerical approximation of the...

A Bilinear Programming Approach for Multiagent Planning

Multiagent planning and coordination problems are common and known to be...

Two-level Group Convolution

Group convolution has been widely used in order to reduce the computatio...

Efficient Winograd Convolution via Integer Arithmetic

Convolution is the core operation for many deep neural networks. The Win...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discrete convolution is a bilinear function that combines two sequences of data to produce a third. Problems such as multiplication [35, 77, 2], signal processing [14, 13, 54, 38], statistics [66, 57], acoustics [1, 73], image processing [55, 72]

, and numerical solvers for partial differential equations within physics and chemistry 

[90, 92, 46] use convolution. Consequently, fast methods for convolution can reduce the computation time for various problems. Given two inputs of size , a direct computation of convolution performs at most additions and multiplications.

Over the years, fast algorithms have been studied and used to compute convolution. Most fast algorithms operate in three steps: compute the linear combinations of both inputs, calculate the element-wise product of those linear combinations, and then recover the result by computing the linear combinations of the products. The first known fast algorithm is Karatsuba’s algorithm [44], which achieves a complexity of . The most prominent fast algorithm employs the discrete Fourier transform (DFT) to obtain suitable linear combinations. This fast algorithm leverages the fast Fourier transform (FFT) to obtain the linear combinations in time [35, 34]. The FFT-approach reduces the number of bilinear products necessary to , yielding an algorithm with an overall cost of instead of the cost incurred by the direct method.

For a convolution with two

-dimensional vectors, the cost and stability of the FFT make it the method of choice. However, in many scenarios, including signal processing and convolutional neural networks (CNN), a small

filter of size is convolved with a large vector of size . A naive application of the FFT requires cost, which is worse than the cost of the direct method when . The use of FFTs of size yields a lower cost of . Furthemore, when is small, the constant factors incurred by FFT, due in part to the use of complex arithmetic, can make it uncompetitive [25]. Given a direct implementation of complex arithmetic, an FFT-based convolution with -dimensional vectors requires real additions and real multiplications. For sufficiently small dimensions, the direct approach requires less work than the FFT. While the direct approach is efficient in such cases, other fast algorithms can obtain yet lower constant factors, yielding practical benefits. Consider the use of convolution in CNNs. The convolutional layer of the CNN architecture, AlexNet [53], takes approximately three hours, about nintey percent of the CNN’s overall runtime, to convolve images when running on a single-threaded CPU [17]. Even a constant factor improvement over the direct method can save minutes to hours for this type of problem. Fast algorithms present a variety of methods with lower cost complexities.

Beyond adaptation for small filters, another remaining challenge is the development of efficient methods for multidimensional (especially, 2D and 3D) convolution algorithms. Efficient algorithms for 2D and 3D convolution are important for applications within scientific computing and CNNs. The FFT-based approach is well-suited for the former domain, but the use of small filters in CNNs again leaves room for further innovation. To the best of our knowledge, the main algorithms for computing convolution in CNNs are either matrix-multipliction [16], the FFT [27], or a few variants of Winograd’s convolution algorithm [91, 55, 6]. We propose other variants of the general Winograd formulation that are suitable for higher dimensions, such as 2D, 3D, and 4D convolutions.

1.1 Previous surveys and key related work

Convolution has been studied and surveyed before in signal processing [67, 39, 62, 8, 12, 79]. Some of these methods have been presented as bilinear algorithms, which provide a framework to define new algorithms for larger convolutions via a matrix nesting by the Kronecker product and embedding using the Winograd convolution algorithm. In addition to using the formalism of bilinear algorithms to define these methods, we provide explicit formulations on how to generate the matrices for the various convolution algorithms. We provide new, simple derivations of many of the key methods, and specially address multidimensional and small-filter convolution scenarios.

An important consideration for bilinear algorithms is the number of additions and element-wise multiplications required to compute the algorithm. As the problem size grows, the overhead of applying linear combinations increases quadratically. Variations of bilinear algorithms for convolution can trade-off the number of linear combinations and element-wise multiplications needed [8], which has subsequently been studied and optimized for various implementations of convolution algorithms [10]. We provide similar tables as well as supplementary material111 for readers to generate the matrices themselves.

With the advent of parallel computing, the scalability of convolution algorithms is crucial for building highly efficient algorithms. The parallelization of convolution with Sobel or Gaussian filters has been studied [70, 29]. Sobel filters are used for edge-detection [70, 43] and Gaussian filters are used for reducing the noise in a signal [29, 22]. However, a more general study of the parallel efficiency of convolution algorithms may be useful as convolution filters are not restricted to the Sobel and Gaussian variants. In the context of CNNs, a direct computation of discrete convolution is fairly straight-forward to parallelize [52]. Convolution can be reduced to a matrix-multiplication and fast Fourier transform problem, both of which can leverage efficient library packages, such as cuDNN [16] for direct convolution on GPUs and FFTW [27] for FFT on shared-memory machines (but also many other for GPUs, shared-memory, and distributed-memory architectures). The family of fast convolution algorithms from signal processing (aside from the FFT) has been largely unused for CNNs prior to the paper by Lavin and Gray [55]

. They propose a method based on Winograd’s formulation of convolution algorithms, although it is later noted to be a variant of the Toom-Cook (interpolation) method 

[5]. The Winograd-based algorithm [55] divides the convolution into three steps, each step performing a sequence of matrix multiplications. Experiments on GPUs suggest that the Winograd-based algorithm can be highly scalable for small-filter convolution problems [55]. In general, both the FFT and Winograd-based method achieve comparable execution times. When executed on GPUs, the FFT and Winograd method achieve speed-ups of up to over the direct (matrix-multiplication-based) approach for AlexNet [48]. Additionally, parallel implementations [24, 42] and specialized hardware designs [74, 14, 71] have been shown to improve the speed of convolution. We focus on the sequential arithmetic complexity and stability of fast algorithms for convolution.

The use of linear combinations in fast algorithms leverages cancellation of sums to reduce the number of element-wise multiplications, however it may also introduce significant error for certain inputs. Since the Winograd-based convolution [55], a modified Toom-Cook method, relies on Vandermonde matrix, the algorithm can quickly become inaccurate for inputs of size greater than four [55, 91]. The absolute error of the Toom-Cook method is proportional to norm of the inputs and the three matrices that compose the bilinear algorithm [5]. Better nodes and the use of the Winograd convolution algorithm with polynomials of degree greater than one have shown promising results in reducing error [5, 6]. In addition to summarizing these results, we show that decomposing a large convolution into nested smaller convolutions can result in more stable algorithms.

Given the wide array of work for convolution in signal processing, image processing, and more recently CNNs, our main contribution is to provide a comprehensive guide for various convolution algorithms and derive simple constructions of the algorithms. To do so, we leverage the general bilinear algorithm formulation, which enables derivation and analysis of fast algorithms using basic algebraic transformations of matrices and tensors.

1.2 Convolution and its variations

The convolution of two continuous functions and is defined as


Given the input vectors and (assume ), the discrete convolution between and is defined as


An explicit definition of the summation bounds for equation (2) produces different types of discrete convolution, detailed in Table 1. The linear convolution, , is equivalent to equation (2) and using bounds that keep the indices within the range of input and output vector dimensions. Cyclic convolution wraps the vectors by evaluating the indices modulo . Equivalently, cyclic convolution is the linear convolution of a periodic signal . When we only want the subset of elements from linear convolution, where every element of the filter is multiplied by an element of , we can use correlation algorithms, as introduced by Winograd  [91]. We can see these are the middle elements from a discrete convolution. Given a filter and input , correlation algorithms compute outputs.

Linear Cyclic Correlation
Table 1: Convolution variations: different formulae for the output element as well as the whole vector .

Each of the three convolution algorithms can be used to solve the other two. Using the Matrix Interchange Theorem (Theorem 2), we can derive correlation algorithms from linear convolution algorithms and vice versa. Cyclic convolution can be computed with linear convolution by appending inputs and with copies of themselves. To solve cyclic convolution with linear convolution, the inputs and are appended with zeros until they are each of size . Convolution can also be cast as a matrix–vector product, where the matrix has a Toeplitz or Hankel structure. For linear convolution, either the filter or input can be embedded as a lower-trapezoidal Toeplitz matrix. The linear convolution of the vector with an arbitrary -dimensional vector can computed with , so that

The multiplication and both compute linear convolution. Other variants may be computed using other Toeplitz matrices as shown in Table 1.

Multidimensional convolution corresponds to convolving along each mode of the inputs. Given a 2D filter and input , linear convolution is computed by


1.3 Paper overview

We survey different applications of convolution in Section 2. We formulate the convolution algorithm as a bilinear algorithm in Section 3, following Pan’s formalism for matrix-multiplication [68]. Then, we present specific implementations of fast algorithms in Section 4, Section 5, Section 6, and Section 7. We leverage our general formulation of fast convolution algorithms to quantify their cost complexity in Section 8. We derive bounds on the numerical stability of algorithms (providing a simplified summary of previous results from [5]) for the Toom-Cook method and provide solutions to reduce the error in Section 9. We provide experimental results on the stability of a variety of 1D and multidimensional convolution algorithms in Section 10. Finally, we present open questions in Section 11.

2 Problems and applications of convolution

Convolution is a key component for many scientific and engineering problems, such as signal processing, partial differential equations, and image processing. In the following section, we examine how linear discrete convolution and correlation convolution algorithms are used in a variety of fields.

2.1 Signal processing

One of the most important tasks in digital signal processing is the filtering of a long signal, represented by a sequence of real and complex numbers. The filtering of the signal is calculated by a digital filter [8], which produces a new signal called an output sequence. FIR filters, or finite-impulse-response filters, is a digital filter that captures the strength of the incoming signal for only a finite period of time [54]. The computation of the output sequence from an FIR filter can be synthesized by discrete convolution [30]. The ubiquity of FIR filters within domains such as noise removal in EKGs [13], image processing for texture mapping [38], and mobile communications [75] have consequently led to the development of highly efficient algorithms for 1D discrete convolution, such as new nesting schemes [14, 78], the Fermat number transform [2], and the Good-Thomas PFA.

2.2 Integer multiplication

Let and be two -digit integers. The value of the two integers can be rewritten as, and , where and are the individual digits of integers and respectively. A direct computation of the product can be formulated as


The similarity of equation (4) to equation (2) allows integer multiplication to be viewed as discrete convolution and vice versa.

2.3 Numerical methods for differential equations

Within physics, chemistry, and engineering, many numerical PDE solvers are based on determining the solution to continuous convolution equations. For example, integral equations for initial and boundary value problems of linear differential equations seek to describe the solution , which arises in with the integration domain described by boundary conditions, where is the Green’s function of the differential operator [51]. These problems are sometimes reduced to discrete convolution, especially when regular grids are used for discretization, often yielding 3D discrete convolution problems. Iterative methods for solutions to convolution equations leverage repeated application of the convolution operator, yielding a series of discrete convolutions. Techniques for fast convolution algorithms, such as the discrete Fourier transform, also provide a way of solving convolution equations directly. Such regular-grid-based solvers are prevalent across a variety of major numerical PDE applications in scientific computing. For example, they are used for acoustic scattering problems [11], for long-range interactions in molecular dynamics (particle-mesh Ewald method) [20], and within quantum chemistry for electronic structure calculations [33, 47, 45, 46] and dynamics [90].

2.4 Image processing

An important step in image recognition is feature extraction. Given an input image and a feature, both of which can be represented as matrices, the goal is compute a feature map for the image. A feature map assigns scores, or activations, to different areas of the original image, which can be computed via a 2D convolution. By connecting a series of 2D convolutions, a CNN can identify global features from localized changes in values, such as inferring shapes from curvature, which enables identification of objects in images 

[81, 60].

The first successful CNN was LeNet5, developed by LeCun et al. in 1994 [56]. Research in CNNs remained relatively modest until the success of AlexNet [53]

in the 2012 ImageNet competition. Since then, there has been a myriad of CNN designs 

[94, 86, 36, 41]. In many contexts, CNNs can be very large in size, making their training very expensive and their performance for inference to be contingent on the efficiency of convolution [15]. A variety of avenues have been pursued for decreasing memory footprint and runtime of CNNs, one example being the reduction in the number of parameters [15, 32, 59, 61].

Fast algorithms for 2D convolution provide a direct approach to speed-up both the training and inference of CNNs. The state of practice of 2D convolution algorithms for CNNs consists of either matrix-multiplications [16], the FFT [27], and an algorithm [55] based on Winograd’s minimal filtering method [91].

We now formally define how the series of 2D convolution are performed, leveraging the correlation form of convolution. A CNN is associated with a set of filters of size stored in the tensor . An input to a CNN will be a set of images stored in the tensor . Each filter and image has channels, such as the RGB channels for color images. The convolutions are summed over the channels and stored in ,


In equation (5), the variable is the index for which of the images we are convolving, and the variable denotes which of the filters is being used. The variables represent the index of the summed feature maps.

2.5 Cosmological data analysis

One example of large-scale data analysis via CNNs arises in cosmology. By studying the distribution of matter across various galaxies, key cosmological parameters, such as halo mass, metallicity, environment, and age can be predicted [23]. However, the study of large cosmological systems such as the Universe are hindered by the size (terabytes to petabytes) and complexity (3D and 4D) of the datasets that describe them.

While previous attempts to identify the parameters rely on hand-crafted statistical parameters [23]

, recent work has explored use of learning models such as variational autoencoders 

[49] and generative adversarial networks [31]. The use of CNNs to identify the parameters have been able to outperform the results of hand-crafted statistical measurements [64, 63] and have shown the potential to handle noise in the data [76].

Although CNNs have currently brought accurate results, designing efficient methods to process the large datasets is still an ongoing research question. Due the size and dimensionality of the data, previous works that relied on CNNs can require up to twenty days of runtime when running on the TensorFlow framework 

[63]. For applications of convolution that act of large amounts of multidimensional data, such as in image processing and cosmology, there is a need for highly efficient and scalable convolution algorithms.

3 Bilinear algorithm representation for convolution

A direct computation of an 1D linear convolution requires additions and multiplications. Faster algorithms can generally be represented using the framework of bilinear algorithms [68]. The linear convolution of 1D vectors and can be defined by a bilinear function,


A CP decomposition [50] of tensor , given by matrices , , and via


specifies a bilinear algorithm [68] for computing ,


The value is the bilinear rank of the algorithm. Matrices and specify linear combinations for inputs and respectively, which serve as respective inputs to a set of products of the two sets of linear combinations. The matrix takes linear combinations of these products to obtain each entry of the output . We refer to the multiplication by matrices and as encoding and the multiplication by matrix as decoding. When an algorithm is applied recursively many times, the bilinear rank plays a key role, since the rank determines the number of recursive calls needed. The asymptotic complexity of the recursive bilinear algorithm usually depends on and not on the particular structure of the matrices .

Given a filter of size and input of size , a direct computation of linear convolution has a rank of . For filter size and output size , the correlation algorithm has bilinear rank of when directly computed. However, algorithms with bilinear rank of exist. The optimality of this bilinear rank has been proven by Winograd [91].

Theorem 1

The minimum rank of a correlation convolution algorithm with filter of size and output of size is .

A proof of the above theorem is presented in [91]. Winograd also shows that by casting the bilinear algorithm to a trilinear algorithm, linear convolution algorithms can be derived from correlation algorithms by swapping variables. Blahut and Barabasz et al. derive a similar result by using Matrix Interchange [5, 8]. We provide an alternative proof by simply swapping the indices of the tensor .

Theorem 2 (Matrix Interchange)

Let the bilinear algorithm for linear convolution and be defined as . The correlation algorithm with output size is



From equation (6), the tensor in satisfies if and otherwise . The bilinear function computing correlation can be expressed via tensor as

with if , and consequently,

Therefore, given a bilinear algorithm to compute convolution, we obtain a bilinear algorithm for the correlation algorithm, since

The conversion between correlation and linear convolution algorithm preserves the number of element-wise multiplications, and subsequently the rank as well.

Corollary 1

The minimum rank on convolution with an input of size and filter of size is .

We now present various bilinear algorithms that achieve the minimal rank.

4 Convolution using polynomial interpolation

Given a discrete set of points with corresponding values , interpolation derives the polynomial the fits the values of as accurately as possible. Given points, a unique degree polynomial exists that satisfies for all  [37].

Recall from Section 2.2 that polynomial multiplication is equivalent to linear convolution. Let the vectors and be the coefficients for a degree polynomial and degree polynomial respectively. The linear convolution of and is equivalent to the coefficients of the polynomial product . By viewing linear convolution as polynomial multiplication, we can apply a family of fast algorithms to convolution, one of which is based on interpolation. The intuition is that we first multiply the values of and at discrete nodes. These products are equivalent to at those same points. We then interpolate on these values to compute the coefficents for polynomial . By carefully selecting the nodes and the basis for interpolation, we can derive algorithms that are both stable and compute linear convolution in asymptotically less time.

Let the matrix be the Vandermonde matrix with distinct nodes. The bilinear algorithm’s encoding matrices and are defined by keeping the first and rows of , respectively [91, 8, 10]. The decoding matrix is then given by . This construction of the matrices , , and creates a bilinear algorithm with rank .

4.1 Karatsuba’s Algorithm

In the late 1950s, Kolmogorov conjectured that integer multiplication (4) had a cost complexity of . Karatsuba refuted the conjecture by developing an algorithm running in time [44]. Karatsuba’s algorithm reuses the previous element-wise multiplications to compute the middle term of a two-digit integer multiplication problem,

With the reformulation, the multiplication now only requires three unique element-wise multiplications instead of four. When the input is larger than two, equation (4.1) can be applied by breaking the integer into two smaller integers and recursively computing each element-wise multiplication. By reducing the problem by a factor of two and making three recursive calls, the asymptotic cost of this algorithm is .

Karatsuba’s algorithm operates in three distinct steps: take linear combinations of the input, compute the element-wise multiplications, and compute the linear combinations of the products. The combination of these three steps is captured by the bilinear algorithm,


This bilinear algorithm can be viewed as an interpolation-evaluation problem using the nodes . The use of the node will be explained in the next section on Toom-Cook algorithms. Toom-Cook algorithms encompass a family of fast algorithms, such as Karatsuba’s algorithm, that operate using a more general bilinear algorithm formulation.

4.2 The Toom-Cook method

Soon after the publication of Karatsuba’s algorithm, Toom developed a generalized algorithm for any input size  [88]. Cook’s Ph.D. thesis formalized Toom’s algorithm into what is now known as the Toom-Cook method [18], which is an explicit definition of the interpolation approach from the beginning of Section 4.

The designer of the Toom-Cook method can freely choose the basis and nodes. Regardless of the basis, both the input and output must be represented in the monomial basis, since convolution is equivalent to polynomial multiplication only in this basis. The Toom-Cook algorithm can be defined by the Lagrangian basis [9, 93, 62, 8]. Using this basis, the polynomial multiplication, , is computed by the summation,


where are the set of unique nodes. Equation (11) can be rewritten as the multiplication by the inverse Vandermonde matrix,


By defining the matrices and by the truncated Vandermonde matrix and matrix by the inverse Vandermonde matrix, as explained in the beginning of Section 4, the bilinear algorithm computes the Toom-Cook algorithm. A common choice of nodes are small integer values, such as . Small integers can limit the magnitude of the scalars in the Vandermonde matrix.

As the number of nodes increases, the number of nonzeros in the Vandermonde matrix grows quadratically. The number of nonzeros in the , , and matrices can be reduced by selecting as a node [62, 10]. The node computes the product between the leading terms of inputs and . To use the node, the last row for each of the decoding matrices, and , is set to all zeros except for the last entry, which is set to . Similarly, the decoding matrix is set to , where is the original Vandermonde matrix with the last row set to all zeros, and the last entry is set to . The Karatsuba algorithm  (10) is a Toom-Cook algorithm with the nodes ,, and .

4.3 Discrete Fourier transform

The use of integer nodes creates Vandermonde matrices that are ill-conditioned, limiting the Toom-Cook method to small linear convolutions. Instead, the set of nodes can be defined by the first nonnegative powers of the primitive th primitive root of unity, . The use of the powers of as nodes generates a Vandermonde matrix that is equivalent to the discrete Fourier matrix, with . The conditioning of this matrix, which is bounded by a constant value, permits us to extend the use of the Toom-Cook method beyond sizes of four. The inverse discrete Fourier matrix is simply . The use of the discrete Fourier matrix and its inverse also defines bilinear algorithms for cyclic convolution [8].

Theorem 3 (Discrete cyclic convolution theorem)

The bilinear algorithm
computes cyclic convolution.


By expanding the bilinear algorithm, , we have the summation,

It suffices to observe that for any fixed or , the outer summation yields a zero result, since the geometric sum simplifies to

Therefore the only non-zero values in the summation are , yielding cyclic convolution.

Other transformations to compute cyclic convolution may be defined based on roots of unity in other finite fields. One example is the Fermat number transform (FNT) [2]. The FNT leverages roots of unity in the ring of integers modulo the Fermat number, for some nonnegative integer . The roots of unity can then be selected as powers of , yielding a transformation that requires only integer or bitmask additions and bit-shifts.

4.4 Fast Fourier transform

Applying the DFT using the fast Fourier transform (FFT) can reduce the complexity of this algorithm from to

. The FFT applies a divide-and-conquer structure to the DFT, which can be seen by breaking the indices into even and odd components,


Computing both terms in equation (13) recursively gives the split-radix-2 variant of the Cooley-Tukey algorithm. In general, this division can be extended to larger parities. For example, consider breaking an -length FFT into FFTs of size ,


This decomposition produces a split-radix- FFT algorithm, which uses FFTs of size followed by FFTs of size . Both approaches yield an cost.

4.5 Discrete trigonometric transform

A disadvantage of the DFT is its reliance of complex arithmetic. Complex additions require two real additions, and a direct computation of complex multiplication yields real multiplications and real addition. While this overhead does not affect the algorithm’s asymptotic cost, the additional work can reduce the efficiency of the algorithm in practice. An alternative is to apply an FFT-like transform with real values, which can be achieved with either the discrete sine or cosine transform.

Instead of using both the sine and cosine values, which necessitates complex values, the use of only the cosine leads to the discrete cosine transform (DCT), which can be used for deriving cyclic convolution algorithms [7]. This construct has the following bilinear algorithm,


where is the matrix derived from the DCT. Similar to the fast Fourier transform, the DCT also has a fast transform with a cost complexity of . However, while DCT avoids complex arithmetic, it requires computation of cosine functions. The cosine transform is orthogonal, so the DCT algorithm, like the FT approach, is stable [84].

5 Convolution using modular polynomial arithmetic

Winograd presents a more general family of convolution algorithms [91] based on modular arithmetic over polynomials. Consider evaluating the remainder of the product , expressed as . When , where we denote the degree of a polynomial by , then . If instead , then , as the remainder of will produce a polynomial of degree at most . Winograd shows that computing remainders of (evaluating and ) with well–chosen polynomial divisors will produce new fast and stable linear convolution algorithms. We first present Winograd’s algorithm for recovering with .

5.1 Winograd’s convolution method

In interpolation, each polynomial is evaluated at a set of discrete points. In Winograd’s convolution algorithm, the remainder of the product is computed using distinct polynomial divisors, . The polynomial divisors , must be coprime, or share no common roots. Together, the product of the polynomials define the larger polynomial divisor, . After computing the remainders with each the polynomial divisors, , the remainder is recovered via the Chinese remainder theorem.

The Chinese remainder theorem for polynomials provides a specification for recovering from the remainders of the polynomial remainders,


so long as . The bound on degree, in combination with the fact that are coprime, ensures that the remainder polynomials uniquely specify . Consequently, defining , Bézout’s identity implies that there exists polynomials and such that


A set of such polynomials can be computed by the extended Euclidean algorithm. The desired polynomial satisfying the set of equivalences (16) can be recovered as


since for , while

Interpolation is a particular instance of a Winograd’s convolution algorithm. By selecting the polynomial divisors to be the polynomial , where are nodes, Winograd’s algorithm is equivalent to the Toom-Cook method using Lagrangian interpolation [8]. The DFT algorithm for linear convolution may be obtained by the polynomial with , whose roots are equally spaced on the unit circle on the complex plane [62]. With the choice for , we obtain cyclic convolution [67], since the remainder polynomial has the right coefficients as

Polynomial divisors can also be chosen to be of degree (superlinear polynomials) [6]. Different degree choices for the polynomial divisors will yield trade-offs between the bilinear rank and the number of additions necessary. A few examples of this trade-off are shown in Table 2 [8, Table 5.2]. The degree choices also affect numerical stability.

Rank Adds
2 2 3 3
2 2 4 7
3 3 5 20
3 3 6 10
3 3 9 4
4 4 7 41
4 4 9 15
Table 2: Number of additions for Winograd’s convolution algorithm with different bilinear ranks

5.2 Bilinear algorithm for Winograd’s convolution method

We now present a formulation of the bilinear algorithm for Winograd’s convolution algorithm. As before, we denote the coefficients of an arbitrary polynomial as . Let be a matrix that can act on the coefficients of any degree polynomial to compute the coefficients of as . Tolimieri provides examples of this matrix [62], which he refers to as the matrix . We provide a succinct algebraic construction of this linear operator,



is an identity matrix of size

, contains the top rows of , and contains the bottom rows of .

Theorem 4

Let , with , then


Let , so that . As , then . Defining , let

where , so . Then we have that . Furthermore, observing that and separating , where is lower-triangular and is upper-triangular, we have

Further, since , we have that , and so . Therefore, we obtain

Using this linear operator, we can now construct an operator for modular polynomial multiplication. Since,

we have that

Further, given a bilinear algorithm to compute linear convolution of two -dimensional vectors, we can obtain an algorithm to compute ,

To implement the Winograd’s convolution algorithm, we need to compute for to obtain the coefficients of in (16). After obtaining these remainders , it suffices to compute (18) by multiplying each with the matrix,

where . Consequently, we can interpret Winograd’s convolution algorithm as a prescription for building a new bilinear algorithm for convolution from a set of bilinear algorithms that compute the linear convolution between two sequences of vectors with dimension .

Definition 1 (Winograd’s Convolution Algorithm)

Given where and are coprime, as well as for , where is a bilinear algorithm for linear convolution of vectors of dimension , Winograd’s convolution algorithm yields a bilinear algorithm for computing linear convolution with vectors of dimension and , where

with and polynomial .

To automatically generate Winograd’s convolution algorithm, it suffices to have a prescription to obtain . Below, we give, to the best our knowledge, the first approach that computes the coefficients of using only linear algebra.

Definition 2

Given coprime polynomials and , the coefficients of polynomials and satisfying are



The polynomials degrees of and are at most and  [4]. Therefore, we can rewrite the equivalence as


To show that the matrix is invertible, we demonstrate that there cannot exist a vector , , such that . Equivalently, we show there cannot exists vectors and such that . Since and are coprime, must be a multiple of . However, because , there cannot exist such a polynomial .

6 Other fast algorithms for convolution

We now discuss two other techniques for fast convolution, which are not based on polynomial algebra.

6.1 Fast symmetric multiplication

Recall that convolution can be solved by a Toeplitz matrix–vector product, . Consider, for simplicity, the scenario when is the dimension of both and . This problem can be converted to a Hankel matrix-vector product by reversing the order of the elements in the vector with , where

We can embed (for simplicity) within a square Hankel matrix, , by appending zero columns to (using to define each anti-diagonal of the matrix), so that . Now, we can observe that this square Hankel matrix is symmetric, and further that this type of matrix can be subdivided recursively into Hankel matrices,

Consequently, we can leverage fast nested bilinear algorithms to compute the product of a symmetric matrix and a vector [82]. These algorithms compute the multiplication of an symmetric matrix with a vector using multiplications. The choice of , requires 3 multiplications, and yields the fastest asymptotic complexity (same as Karatsuba’s algorithm ). This variant of the algorithm performs the Hankel matrix–vector product using the transformation,

The new form can be computed with Hankel–vector products of half the dimension. The addition of the Hankel submatrices can be computed with additions. Therefore, the cost of the fast symmetric algorithm is by directly computing the convolution once .

6.2 Minimizing scalar products

There remain other bilinear algorithms for convolution not covered by the techniques in the previous sections. For example, a bilinear algorithm for linear convolution of -dimensional vectors can be derived by the factorization [8],