    # Type-II/III DCT/DST algorithms with reduced number of arithmetic operations

We present algorithms for the discrete cosine transform (DCT) and discrete sine transform (DST), of types II and III, that achieve a lower count of real multiplications and additions than previously published algorithms, without sacrificing numerical accuracy. Asymptotically, the operation count is reduced from 2N log_2 N to (17/9) N log_2 N for a power-of-two transform size N. Furthermore, we show that a further N multiplications may be saved by a certain rescaling of the inputs or outputs, generalizing a well-known technique for N=8 by Arai et al. These results are derived by considering the DCT to be a special case of a DFT of length 4N, with certain symmetries, and then pruning redundant operations from a recent improved fast Fourier transform algorithm (based on a recursive rescaling of the conjugate-pair split radix algorithm). The improved algorithms for DCT-III, DST-II, and DST-III follow immediately from the improved count for the DCT-II.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In this paper, we describe recursive algorithms for the type-II and type-III discrete cosine and sine transforms (DCT-II and DCT-III, and also DST-II and DST-III), of power-of-two sizes, that require fewer total real additions and multiplications than previously published algorithms (with an asymptotic reduction of about 5.6%), without sacrificing numerical accuracy. Our DCT and DST algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) compared to the previous-best split-radix algorithm . The gains in this new FFT algorithm, and consequently in the new DCTs and DSTs, stem from a recursive rescaling of the internal multiplicative factors within an algorithm called a “conjugate-pair” split-radix FFT [2, 3, 4, 5] so as to simplify some of the multiplications. In order to derive a DCT algorithm from this FFT, we simply consider the DCT-II to be a special case of a DFT with real input of a certain even symmetry, and discard the redundant operations [6, 7, 8, 9, 10, 1]. The DCT-III, DST-II, and DST-III have identical operation counts to the DCT-II of the same size, since the algorithms are related by simple transpositions, permutations, and sign flips [11, 12, 13].

Since 1968, the lowest total count of real additions and multiplications, herein called “flops” (floating-point operations), for the DFT of a power-of-two size was achieved by the split-radix algorithm, with flops for [14, 15, 16, 6, 8]. This count was recently surpassed by new algorithms achieving a flop count of [1, 17]. Similarly, the lowest-known flop count for the DCT-II of size was previously for a unitary normalization (with the additive constant depending on normalization) [18, 6, 19, 7, 20, 21, 12, 22, 13, 23, 24, 25, 26], and could be achieved by starting with the split-radix FFT and discarding redundant operations [6, 7]. (Many DCT algorithms with an unreported or larger flop count have also been described [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40].) Based on our new FFT algorithm, the flop counts for the various DCT types were reduced using a code generator [41, 10] that automatically pruned the redundant operations from an FFT with a given symmetry, but neither an explicit algorithm nor a general formula for the flop count were presented except for DCT-I . In this paper, we use the same starting point to “manually” derive a DCT-II algorithm by pruning redundant operations from a real-even FFT, and give the general formula for the new flop count (for ):

 179Nlog2N−1727N−19(−1)log2Nlog2N+754(−1)log2N+32. (1)

The first savings over the previous record occur for , and are summarized in Table I for several .

We also consider the effect of normalization on this flop count: the above count was for a unitary transform, but slightly different counts are obtained by other choices. Moreover, we show that a further multiplications can be saved by individually rescaling every output of the DCT-II (or input of the DCT-III). In doing so, we generalize a result by Arai et al., who showed that eight multiplications could be saved by scaling the outputs of a DCT-II of size , a result commonly applied to JPEG compression .

If we merely wished to show that the DCT-II/III could be computed in operations, we could do so by using well-known techniques to re-express the DCT in terms of a real-input DFT of length plus pre/post-processing operations [29, 30, 33, 44, 45, 46], and then substituting our new FFT that requires operations for real inputs . However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations (which, for the DCT-II of power-of-two sizes, had not changed by even one operation for over 20 years). Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the terms as well as the asymptotic constant. It turns out, in fact, that our algorithm to achieve (1) is closely related to well-known algorithms for expressing the DCT-II in terms of a real-input FFT of the same length, but it does not seem obvious a priori that this is what one obtains by pruning our FFT for symmetric data.

In the following sections, we first review how a DCT-II may be expressed as a special case of a DFT, and how the previous optimum flop count can be achieved by pruning redundant operations from a conjugate-pair split-radix FFT. Then, we briefly review the new FFT algorithm presented in , and derive the new DCT-II algorithm. We follow by considering the effect of normalization and scaling. Finally, we consider the extension of this algorithm to algorithms for the DCT-III, DST-II, and DST-III. We close with some concluding remarks about future directions. We emphasize that no proven tight lower bound on the DCT-II flop count is currently known, and we make no claim that eq. (1) is the lowest possible (although we have endeavored not to miss any obvious optimizations).

## Ii DCT-II in terms of DFT

Various forms of discrete cosine transform have been defined, corresponding to different boundary conditions on the transform . Perhaps the most common form is the type-II DCT, used in image compression 

and many other applications. The DCT-II is typically defined as a real, orthogonal (unitary), linear transformation by the formula (for

):

 CIIk=√2−δk,0NN−1∑n=0xncos[πN(n+12)k], (2)

for inputs and outputs , where is the Kronecker delta ( for and otherwise). However, we wish to emphasize in this paper that the DCT-II (and, indeed, all types of DCT) can be viewed as special cases of the discrete Fourier transform (DFT) with real inputs of a certain symmetry, and where only a subset of the outputs need be computed. This (well known) viewpoint is fruitful because it means that any FFT algorithm for the DFT leads immediately to a corresponding fast algorithm for the DCT-II simply by discarding the redundant operations [6, 7, 8, 9, 10, 1].

The discrete Fourier transform of size is defined by

 Xk=N−1∑n=0xnωnkN, (3)

where is an th primitive root of unity. In order to relate this to the DCT-II, it is convenient to choose a different normalization for the latter transform:

 Ck=2N−1∑n=0xncos[πN(n+12)k]. (4)

This normalization is not unitary, but it is more directly related to the DFT and therefore more convenient for the development of algorithms. Of course, any fast algorithm for trivially yields a fast algorithm for , although the exact count of required multiplications depends on the normalization, an issue we discuss in more detail in section V.

In order to derive from the DFT formula, one can use the identity to write:

 (5)

where is a real-even sequence of length (i.e. ), defined as follows for :

 ~x2n=~x4N−2n−2=0, (6)
 ~x2n+1=~x4N−(2n+1)=xn. (7)

Thus, the DCT-II of size is precisely the first outputs of a DFT of size , of real-even inputs, where the even-indexed inputs are zero. Figure 1: A DCT-II of size N=8 (open dots, x0,…,x7) is equivalent to a size-4N DFT via interleaving with zeros (black dots) and extending in an even (squares) periodic (gray) sequence.

This is illustrated by an example, for , in figure 1. The eight inputs of the DCT are shown as open dots, which are interleaved with zeros (black dots) and extended to an even (squares) periodic (gray dots) sequence of length corresponding to the DFT. (The type-II DCT is distinguished from the other DCT types by the fact that it is even about both the left and right boundaries of the original data, and the symmetry points fall halfway in between pairs of the original data points.) We will refer, below, to this figure in order to illustrate what happens when an FFT algorithm is applied to this real-symmetric zero-interleaved data.

## Iii Conjugate-pair FFT and DCT-II

Although the previous minimum flop count for DCT-II algorithms can be derived from the ordinary split-radix FFT algorithm [6, 7] (and it can also be derived in other ways [18, 19, 20, 21, 12, 22, 13, 23, 24, 25, 26]), here we will do the same thing using a variant dubbed the “conjugate-pair” split-radix FFT. This algorithm was originally proposed to reduce the number of flops , but was later shown to have the same flop count as ordinary split-radix [3, 4, 5]. It turns out, however, that the conjugate-pair algorithm exposes symmetries in the multiplicative factors that can be exploited to reduce the flop count by an appropriate rescaling , which we will employ in the following sections.

Starting with the DFT of equation (3), the decimation-in-time conjugate-pair FFT splits it into three smaller DFTs: one of size of the even-indexed inputs, and two of size :

 Xk=N/2−1∑n2=0ωn2kN/2x2n2+ωkNN/4−1∑n4=0ωn4kN/4x4n4+1+ω−kNN/4−1∑n4=0ωn4kN/4x4n4−1, (8)

where the indices are computed modulo . [In contrast, the ordinary split-radix FFT uses for the third sum (a cyclic shift of ), with a corresponding multiplicative “twiddle” factor of .] This decomposition is repeated recursively (until base cases of size or , not shown, are reached), as shown by the pseudo-code in Algorithm 1.

Here, we denote the results of the three sub-transforms of size , , and by , , and , respectively. The number of flops required by this algorithm, after certain simplifications (common subexpression elimination and constant folding, and special-case simplifications of the constants for and ) and not counting data-independent operations like the computation of , is , identical to ordinary split radix.

In the following sections, we will have to exploit further simplifications for the case where the inputs are real. In this case, (where denotes complex conjugation) and one can save slightly more than half of the flops, both in the ordinary split-radix [7, 48] and in the conjugate-pair split-radix , by eliminating redundant operations, to achieve a flop count of . Specifically, one need only compute the outputs for (where the and outputs are purely real), and the corresponding algorithm is shown in Algorithm 2.

Again, for simplicity we do not show special-case optimizations for and , so it may appear that the , , and outputs are computed twice. To derive Algorithm 2, one exploits two facts. First, the sub-transforms operate on real data and therefore have conjugate-symmetric output. Second, the twiddle factors in Algorithm 1 have some redundancy because of the identity , which allows us to share twiddle factors between and in the original loop.

### Iii-B Fast DCT-II via split-radix FFT

Before we derive our fast DCT-II with a reduced flop count, we first derive a DCT-II algorithm with the same flop count as previously published algorithms, but starting with the conjugate-pair split-radix algorithm. This algorithm will then be modified in section IV-B, below, to reduce the number of multiplications.

In eq. (5), we showed that a DCT-II of size , with inputs and outputs , can be expressed as the first outputs of a DFT of size of real-even inputs . [Therefore, in eq. (8) above and in the surrounding text are replaced below by .] Now, consider what happens when we evaluate this DFT via the conjugate-pair split decomposition in eq. (8) and Algorithm 2. In this algorithm, we compute three smaller DFTs. First, is the DFT of size of the inputs, but these are all zero and so is zero. Second, is the DFT of size of the inputs (where is evaluated modulo ), which by eq. (7) correspond to the original data by the formula:

 ~yn=~x4n+1={x2n0≤n

where we have denoted them by () for convenience. That is, is the real-input DFT of the even-indexed elements of

followed by the odd-indexed elements of

in reverse order. For example, this is shown for in figure 2 with the circled points corresponding to the , which are clearly the even-indexed followed by the odd-indexed in reverse. Figure 2: The DCT-II of N=8 points xn (open dots) is computed, in a conjugate-pair FFT of the ~N=32 extended data ~xn (squares) from figure 1, via the DFT Zk of the circled points ~x4n+1, corresponding to the even-indexed xn followed by the odd-indexed xn.

The second DFT, , of size is redundant: it is the DFT of , but by the even symmetry of this is equal to , and therefore (the complex conjugate of ).

Therefore, a fast DCT-II is obtained merely by computing a single real-input DFT of size to obtain , and then combining this according to Algorithm 2 to obtain for . In particular, in the loop of Algorithm 2, all but the and terms correspond to subscripts and are not needed. Moreover, we obtain

 Xk =ωk~NZk+ω−k~NZ∗k=2R(ωk4NZk), XN−k =−i(ωk~NZk−ω−k~NZ∗k)∗=−2I(ωk4NZk).

The resulting algorithm, including the special-case optimizations for (where and is real) and (where and is real), is shown in Algorithm 3.

In fact, Algorithm 3 is equivalent to an algorithm derived in a different way by various authors [30, 33] to express a DCT-II in terms of a real-input DFT of the same length. Here, we see that this algorithm is exactly equivalent to a conjugate-pair split-radix FFT of length . Previous authors obtained a suboptimal flop count for the DCT-II using this algorithm  only because they used a suboptimal real-input DFT for the subtransform (split-radix being then almost unknown). Using a real-input split-radix DFT to compute , the flop count for is from above. To get the total flop count for Algorithm 3, we need to add general complex multiplications by (6 flops each) plus two real multiplications (2 flops), for a total of flops. This matches the best-known flop count in the literature (where the can be removed merely by choosing a different normalization as discussed in section V).

## Iv New FFT and DCT-II

In this section, we first review the new FFT algorithm introduced in our previous work based on a recursive rescaling of the conjugate-pair FFT , and then apply it to derive a fast DCT-II algorithm as in the previous section.

### Iv-a New FFT

Based on the conjugate-pair split-radix FFT from section III, a new FFT algorithm with a reduced number of flops can be derived by scaling the subtransforms . We will not reproduce the derivation here, but will simply summarize the results. In particular, the original conjugate-pair split-radix Algorithm 1 is split into four mutually recursive subroutines, each of which has the same split-radix structure but computes a DFT scaled by a different factor. These algorithms are shown in Algorithm 4 for the case of complex inputs, and in Algorithm 5 specialized for real inputs, in both of which the scaling factors are combined with the twiddle factors to reduce the total number of multiplications compared to Algorithms 1 and 2, respectively.

Again, special cases for and , as well as the base cases and obvious simplifications such as common-subexpression elimination, are not shown for simplicity.

The key aspect of these algorithms is the scale factor , where the subtransforms compute the DFT scaled by for . (The subroutines are condensed into one in Algorithms 45, by making a variable, but in practice they need to be implemented separately in order to exploit the special cases of the scale factor . The case is written separately because it is factorized differently.) This scale factor is defined for by the following recurrence, where mod :

 sN=2m,k=⎧⎪ ⎪⎨⎪ ⎪⎩1for N≤4sN/4,k4cos(2πk4/N)for k4≤N/8sN/4,k4sin(2πk4/N)otherwise. (10)

This definition has the properties: , , and . Also, and decays rather slowly with : has an asymptotic lower bound . When these scale factors are combined with the twiddle factors , we obtain terms of the form

 tN,k=ωkNsN/4,ksN,k, (11)

which is always a complex number of the form or and can therefore be multiplied with two fewer real multiplications than are required to multiply by . Because of the symmetry , it is possible to specialize Algorithm 4 for real data in the same way as in Algorithm 2, because the scale factor preserves the conjugate symmetry of the outputs of all subtransforms .111In Algorithm 5, we have also utilized various symmetries of the scale factors, such as , as described in our previous work , to make explicit the shared multiplicative factors between the different terms.

The resulting flop count, for arbitrary complex data , is then reduced from to

 349Nlog2N−12427N−2log2N−29(−1)log2Nlog2N+1627(−1)log2N+8. (12)

In particular, assuming that complex multiplications are implemented as four real multiplications and two real additions, then the savings are purely in the number of real multiplications. The number of real multiplications saved over ordinary split radix is given by:

 M(N)=29Nlog2N−3827N+2log2N+29(−1)log2Nlog2N−1627(−1)log2N. (13)

If the data are purely real, it was shown that multiplications are saved over the corresponding real-input split-radix algorithm . These flop counts are to compute the original, unscaled definition of the DFT. If one is allowed to scale the outputs by any factor desired, then scaling by [corresponding to in Algorithm 4], saves an additional multiplications for complex inputs, where:

 MS(N)=29Nlog2N−2027N+29(−1)log2Nlog2N−727(−1)log2N+1. (14)

For real inputs, one similarly saves flops for operating on real in Algorithm 5.

Although the division by a cosine or sine in the scale factor may, at first glance, seem as if it may exacerbate floating-point errors, this is not the case. Cooley–Tukey based FFT algorithms have root-mean-square error growth, and

error bounds, in finite-precision floating-point arithmetic (assuming that the trigonometric constants are precomputed accurately)

[49, 50, 51], and the new FFT is no different . The reason for this is straightforward: it never adds or subtracts scaled and unscaled values. Instead, wherever the original FFT would compute , the new FFT computes for some fixed scale factor .

### Iv-B Fast DCT-II from new FFT

To derive the new DCT-II algorithm based on the new FFT of the previous section, we follow exactly the same process as in Sec. III-B. We express the DCT-II of length in terms of a DFT of length , apply the FFT algorithm 5, and discard the redundant operations. As before, the subtransform is identically zero, is the transform of the even-indexed elements of followed by the odd-indexed elements in reverse order, and is redundant.

The resulting algorithm is shown in Algorithm 6, and differs from the original fast DCT-II of Algorithm 3 in only two ways. First, instead of calling the ordinary split-radix (or conjugate-pair) FFT for the subtransform, it calls . Second, because this subtransform is scaled by , the twiddle factor in Algorithm 6 is multiplied by .

To derive the flop count for Algorithm 6, we merely need to add the flop count for the subtransform [which saves flops, from eq. (14), compared to ordinary real-input split radix] with the number of flops in the loop, where the latter is exactly the same as for Algorithm 3 because the products can be precomputed. Therefore, we save a total of flops compared to the previous best flop count of , resulting in the flop count of eq. (1). This formula matches the flop count that was achieved by an automatic code generator in Ref. .

Because the new DCT algorithm is mathematically merely a special set of inputs for the underlying FFT, it will have the same favorable logarithmic error bounds as discussed in the previous section.

## V Normalizations

In the above definition of DCT-II, we use a scale factor “2” in front of the summation in order to make the DCT-II directly equivalent to the corresponding DFT. But in some circumstances, it is useful to multiply by other factors, and different normalizations lead to slightly different flop counts. For example, one could use the unitary normalization from eq. (2), which replaces by or (for ) and requires the same number of flops. If one uses the unitary normalization multiplied by , then one saves two multiplications in the and terms (in this normalization, and ) for both Algorithm 3 and Algorithm 6. (This leads to a commonly quoted formula for the previous flop count.)

It is also common to compute a DCT-II with scaled outputs, e.g. for the JPEG image-compression standard where an arbitrary scaling can be absorbed into a subsequent quantization step , and in this case the scaling can save 8 multiplications  over the 42 flops required for an unitary 8-point DCT-II. Since our attempts to be the optimally scaled FFT, we should be able to derive this scaled DCT-II by using for our length- DFT instead of , and again pruning the redundant operations. Doing so, we obtain an algorithm identical to Algorithm 6 except that is replaced by , which requires fewer multiplications, and also we now obtain and . This algorithm computes instead of . In particular, we save exactly multiplications over Algorithm 6, which matches the result by Ref.  for but generalizes it to all . This savings of multiplications for a scaled DCT-II also matches the flop count that was achieved by an automatic code generator in Ref. .

## Vi Fast DCT-III, DST-II, and DST-III

Given any algorithm for the DCT-II, one can immediately obtain algorithms for the DCT-III, DST-II, and DST-III with exactly the same number of arithmetic operations. In this way, any improvement in the DCT-II, such as the one described in this paper, immediately leads to improved algorithms for those transforms and vice versa. In this section, we review the equivalence between those transforms.

### Vi-a Dct-Iii

To obtain a DCT-III algorithm from a DCT-II algorithm, one can exploit two facts. First, the DCT-III, viewed as a matrix, is simply the transpose of the DCT-II matrix. Second, any linear algorithm can be viewed as a linear network, and the transpose operation is computed by the network transposition of this algorithm . To review, a linear network represents the algorithm by a directed acyclic graph, where the edges represent multiplications by constants and the vertices represent additions of the incoming edges. Network transposition simply consists of reversing the direction of every edge. We prove below that the transposed network requires the same number of additions and multiplications for networks with the same number of inputs and outputs, and therefore the DCT-III can be computed in the same number of flops as the DCT-II by the transposed algorithm. The DCT-III corresponding to the transpose of eq. (4) is

 CTk=2N−1∑n=0xncos[πNn(k+12)]. (15)

Since the DCT-III is the transpose of the DCT-II, a fast DCT-III algorithm is obtained by network transposition of a fast DCT-II algorithm. For example, network transposition of a size-2 DCT-III is shown in figure. 3. Figure 3: Example of network transposition for a size-2 DCT-II (left). When the linear network is transposed (all edges are reversed), the resulting network computes the transposed-matrix operation: a size-2 DCT-III (right).

A proof that the transposed network requires the same number of flops is as follows. Clearly, the number of multiplications, the number of edges with weight , is unchanged by transposition. The number of additions is given by the sum of for all the vertices, except for the input vertices which have indegree zero. That is, for a set of vertices and a set of edges: