## I Introduction

In this paper, we describe recursive algorithms for the type-II and type-III discrete cosine and sine transforms (DCT-II and DCT-III, and also DST-II and DST-III), of power-of-two sizes, that require fewer total real additions and multiplications than previously published algorithms (with an asymptotic reduction of about 5.6%), without sacrificing numerical accuracy. Our DCT and DST algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) compared to the previous-best split-radix algorithm [1]. The gains in this new FFT algorithm, and consequently in the new DCTs and DSTs, stem from a recursive rescaling of the internal multiplicative factors within an algorithm called a “conjugate-pair” split-radix FFT [2, 3, 4, 5] so as to simplify some of the multiplications. In order to derive a DCT algorithm from this FFT, we simply consider the DCT-II to be a special case of a DFT with real input of a certain even symmetry, and discard the redundant operations [6, 7, 8, 9, 10, 1]. The DCT-III, DST-II, and DST-III have identical operation counts to the DCT-II of the same size, since the algorithms are related by simple transpositions, permutations, and sign flips [11, 12, 13].

Since 1968, the lowest total count of real additions and multiplications, herein called “flops” (floating-point operations), for the DFT of a power-of-two size was achieved by the split-radix algorithm, with flops for [14, 15, 16, 6, 8]. This count was recently surpassed by new algorithms achieving a flop count of [1, 17]. Similarly, the lowest-known flop count for the DCT-II of size was previously for a unitary normalization (with the additive constant depending on normalization) [18, 6, 19, 7, 20, 21, 12, 22, 13, 23, 24, 25, 26], and could be achieved by starting with the split-radix FFT and discarding redundant operations [6, 7]. (Many DCT algorithms with an unreported or larger flop count have also been described [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40].) Based on our new FFT algorithm, the flop counts for the various DCT types were reduced using a code generator [41, 10] that automatically pruned the redundant operations from an FFT with a given symmetry, but neither an explicit algorithm nor a general formula for the flop count were presented except for DCT-I [1]. In this paper, we use the same starting point to “manually” derive a DCT-II algorithm by pruning redundant operations from a real-even FFT, and give the general formula for the new flop count (for ):

(1) |

The first savings over the previous record occur for , and are summarized in Table I for several .

previous best DCT-II | New algorithm | |
---|---|---|

16 | 114 | 112 |

32 | 290 | 284 |

64 | 706 | 686 |

128 | 1666 | 1614 |

256 | 3842 | 3708 |

512 | 8706 | 8384 |

1024 | 19458 | 18698 |

2048 | 43010 | 41266 |

4096 | 94210 | 90264 |

We also consider the effect of normalization on this flop count:
the above count was for a unitary transform, but slightly different
counts are obtained by other choices. Moreover, we show that a further
multiplications can be saved by individually rescaling *every*
output of the DCT-II (or input of the DCT-III). In doing so, we generalize
a result by Arai *et al.*, who showed that eight multiplications
could be saved by scaling the outputs of a DCT-II of size [42],
a result commonly applied to JPEG compression [43].

If we merely wished to show that the DCT-II/III could be computed
in operations, we could do so by using
well-known techniques to re-express the DCT in terms of a real-input
DFT of length plus pre/post-processing operations [29, 30, 33, 44, 45, 46],
and then substituting our new FFT that requires
operations for real inputs [1]. However, with FFT and
DCT algorithms, there is great interest in obtaining not only the
best possible asymptotic constant factor, but also the best possible
*exact* count of arithmetic operations (which, for the DCT-II
of power-of-two sizes, had not changed by even one operation for over
20 years). Our result (1) is intended as a new
upper bound on this (still unknown) minimum exact count, and therefore
we have done our best with the terms as well as the asymptotic
constant. It turns out, in fact, that our algorithm to achieve (1)
is closely related to well-known algorithms for expressing the DCT-II
in terms of a real-input FFT of the same length, but it does not seem
obvious *a priori* that this is what one obtains by pruning our
FFT for symmetric data.

In the following sections, we first review how a DCT-II may be expressed
as a special case of a DFT, and how the previous optimum flop count
can be achieved by pruning redundant operations from a conjugate-pair
split-radix FFT. Then, we briefly review the new FFT algorithm presented
in [1], and derive the new DCT-II algorithm. We follow
by considering the effect of normalization and scaling. Finally, we
consider the extension of this algorithm to algorithms for the DCT-III,
DST-II, and DST-III. We close with some concluding remarks about future
directions. We emphasize that no *proven* tight lower bound on
the DCT-II flop count is currently known, and we make no claim that
eq. (1) is the lowest possible (although
we have endeavored not to miss any obvious optimizations).

## Ii DCT-II in terms of DFT

Various forms of discrete cosine transform have been defined, corresponding to different boundary conditions on the transform [47]. Perhaps the most common form is the type-II DCT, used in image compression [43]

and many other applications. The DCT-II is typically defined as a real, orthogonal (unitary), linear transformation by the formula (for

):(2) |

for inputs and outputs , where is the Kronecker delta ( for and otherwise). However, we wish to emphasize in this paper that the DCT-II (and, indeed, all types of DCT) can be viewed as special cases of the discrete Fourier transform (DFT) with real inputs of a certain symmetry, and where only a subset of the outputs need be computed. This (well known) viewpoint is fruitful because it means that any FFT algorithm for the DFT leads immediately to a corresponding fast algorithm for the DCT-II simply by discarding the redundant operations [6, 7, 8, 9, 10, 1].

The discrete Fourier transform of size is defined by

(3) |

where is an th primitive root of unity. In order to relate this to the DCT-II, it is convenient to choose a different normalization for the latter transform:

(4) |

This normalization is not unitary, but it is more directly related to the DFT and therefore more convenient for the development of algorithms. Of course, any fast algorithm for trivially yields a fast algorithm for , although the exact count of required multiplications depends on the normalization, an issue we discuss in more detail in section V.

In order to derive from the DFT formula, one can use the identity to write:

(5) |

where is a real-even sequence of length (i.e. ), defined as follows for :

(6) |

(7) |

Thus, the DCT-II of size is precisely the first outputs of a DFT of size , of real-even inputs, where the even-indexed inputs are zero.

This is illustrated by an example, for , in figure 1. The eight inputs of the DCT are shown as open dots, which are interleaved with zeros (black dots) and extended to an even (squares) periodic (gray dots) sequence of length corresponding to the DFT. (The type-II DCT is distinguished from the other DCT types by the fact that it is even about both the left and right boundaries of the original data, and the symmetry points fall halfway in between pairs of the original data points.) We will refer, below, to this figure in order to illustrate what happens when an FFT algorithm is applied to this real-symmetric zero-interleaved data.

## Iii Conjugate-pair FFT and DCT-II

Although the previous minimum flop count for DCT-II algorithms can be derived from the ordinary split-radix FFT algorithm [6, 7] (and it can also be derived in other ways [18, 19, 20, 21, 12, 22, 13, 23, 24, 25, 26]), here we will do the same thing using a variant dubbed the “conjugate-pair” split-radix FFT. This algorithm was originally proposed to reduce the number of flops [2], but was later shown to have the same flop count as ordinary split-radix [3, 4, 5]. It turns out, however, that the conjugate-pair algorithm exposes symmetries in the multiplicative factors that can be exploited to reduce the flop count by an appropriate rescaling [1], which we will employ in the following sections.

### Iii-a Conjugate-pair split-radix FFT

Starting with the DFT of equation (3), the decimation-in-time conjugate-pair FFT splits it into three smaller DFTs: one of size of the even-indexed inputs, and two of size :

(8) |

where the indices are computed modulo . [In contrast, the ordinary split-radix FFT uses for the third sum (a cyclic shift of ), with a corresponding multiplicative “twiddle” factor of .] This decomposition is repeated recursively (until base cases of size or , not shown, are reached), as shown by the pseudo-code in Algorithm 1.

Here, we denote the results of the three sub-transforms of size , , and by , , and , respectively. The number of flops required by this algorithm, after certain simplifications (common subexpression elimination and constant folding, and special-case simplifications of the constants for and ) and not counting data-independent operations like the computation of , is , identical to ordinary split radix.

In the following sections, we will have to exploit further simplifications for the case where the inputs are real. In this case, (where denotes complex conjugation) and one can save slightly more than half of the flops, both in the ordinary split-radix [7, 48] and in the conjugate-pair split-radix [1], by eliminating redundant operations, to achieve a flop count of . Specifically, one need only compute the outputs for (where the and outputs are purely real), and the corresponding algorithm is shown in Algorithm 2.

Again, for simplicity we do not show special-case optimizations for and , so it may appear that the , , and outputs are computed twice. To derive Algorithm 2, one exploits two facts. First, the sub-transforms operate on real data and therefore have conjugate-symmetric output. Second, the twiddle factors in Algorithm 1 have some redundancy because of the identity , which allows us to share twiddle factors between and in the original loop.

### Iii-B Fast DCT-II via split-radix FFT

Before we derive our fast DCT-II with a reduced flop count, we first
derive a DCT-II algorithm with the *same* flop count as previously
published algorithms, but starting with the conjugate-pair split-radix
algorithm. This algorithm will then be modified in section IV-B,
below, to reduce the number of multiplications.

In eq. (5), we showed that a DCT-II of
size , with inputs and outputs , can be expressed
as the first outputs of a DFT of size of real-even
inputs . [Therefore, in eq. (8)
above and in the surrounding text are replaced below by .]
Now, consider what happens when we evaluate this DFT via the conjugate-pair
split decomposition in eq. (8) and Algorithm 2.
In this algorithm, we compute three smaller DFTs. First,
is the DFT of size of the inputs,
but these are all zero and so is *zero*. Second,
is the DFT of size of the inputs
(where is evaluated modulo ), which by eq. (7)
correspond to the original data by the formula:

(9) |

where we have denoted them by () for
convenience. That is, is the *real-input* DFT of the
even-indexed elements of

followed by the odd-indexed elements of

in reverse order. For example, this is shown for in figure 2 with the circled points corresponding to the , which are clearly the even-indexed followed by the odd-indexed in reverse.The second DFT, , of size is redundant: it is the DFT of , but by the even symmetry of this is equal to , and therefore (the complex conjugate of ).

Therefore, a fast DCT-II is obtained merely by computing a *single*
real-input DFT of size to obtain , and then
combining this according to Algorithm 2
to obtain for . In particular, in the
loop of Algorithm 2, all but the
and terms correspond to subscripts
and are not needed. Moreover, we obtain

The resulting algorithm, including the special-case optimizations for (where and is real) and (where and is real), is shown in Algorithm 3.

In fact, Algorithm 3 is equivalent to an algorithm derived in a different way by various authors [30, 33] to express a DCT-II in terms of a real-input DFT of the same length. Here, we see that this algorithm is exactly equivalent to a conjugate-pair split-radix FFT of length . Previous authors obtained a suboptimal flop count for the DCT-II using this algorithm [33] only because they used a suboptimal real-input DFT for the subtransform (split-radix being then almost unknown). Using a real-input split-radix DFT to compute , the flop count for is from above. To get the total flop count for Algorithm 3, we need to add general complex multiplications by (6 flops each) plus two real multiplications (2 flops), for a total of flops. This matches the best-known flop count in the literature (where the can be removed merely by choosing a different normalization as discussed in section V).

## Iv New FFT and DCT-II

In this section, we first review the new FFT algorithm introduced in our previous work based on a recursive rescaling of the conjugate-pair FFT [1], and then apply it to derive a fast DCT-II algorithm as in the previous section.

### Iv-a New FFT

Based on the conjugate-pair split-radix FFT from section III, a new FFT algorithm with a reduced number of flops can be derived by scaling the subtransforms [1]. We will not reproduce the derivation here, but will simply summarize the results. In particular, the original conjugate-pair split-radix Algorithm 1 is split into four mutually recursive subroutines, each of which has the same split-radix structure but computes a DFT scaled by a different factor. These algorithms are shown in Algorithm 4 for the case of complex inputs, and in Algorithm 5 specialized for real inputs, in both of which the scaling factors are combined with the twiddle factors to reduce the total number of multiplications compared to Algorithms 1 and 2, respectively.

Again, special cases for and , as well as the base cases and obvious simplifications such as common-subexpression elimination, are not shown for simplicity.

The key aspect of these algorithms is the scale factor , where the subtransforms compute the DFT scaled by for . (The subroutines are condensed into one in Algorithms 4–5, by making a variable, but in practice they need to be implemented separately in order to exploit the special cases of the scale factor [1]. The case is written separately because it is factorized differently.) This scale factor is defined for by the following recurrence, where mod :

(10) |

This definition has the properties: , , and . Also, and decays rather slowly with : has an asymptotic lower bound [1]. When these scale factors are combined with the twiddle factors , we obtain terms of the form

(11) |

which is always a complex number of the form
or and can therefore be multiplied
with two fewer real multiplications than are required to multiply
by . Because of the symmetry ,
it is possible to specialize Algorithm 4 for real
data in the same way as in Algorithm 2,
because the scale factor preserves the conjugate symmetry
of the outputs of all subtransforms [1].^{1}^{1}1In Algorithm 5, we have also utilized various
symmetries of the scale factors, such as ,
as described in our previous work [1], to make explicit
the shared multiplicative factors between the different terms.

The resulting flop count, for arbitrary complex data , is then reduced from to

(12) |

In particular, assuming that complex multiplications are implemented as four real multiplications and two real additions, then the savings are purely in the number of real multiplications. The number of real multiplications saved over ordinary split radix is given by:

(13) |

If the data are purely real, it was shown that multiplications are saved over the corresponding real-input split-radix algorithm [1]. These flop counts are to compute the original, unscaled definition of the DFT. If one is allowed to scale the outputs by any factor desired, then scaling by [corresponding to in Algorithm 4], saves an additional multiplications for complex inputs, where:

(14) |

For real inputs, one similarly saves flops for operating on real in Algorithm 5.

Although the division by a cosine or sine in the scale factor may, at first glance, seem as if it may exacerbate floating-point errors, this is not the case. Cooley–Tukey based FFT algorithms have root-mean-square error growth, and

error bounds, in finite-precision floating-point arithmetic (assuming that the trigonometric constants are precomputed accurately)

[49, 50, 51], and the new FFT is no different [1]. The reason for this is straightforward: it never adds or subtracts scaled and unscaled values. Instead, wherever the original FFT would compute , the new FFT computes for some fixed scale factor .### Iv-B Fast DCT-II from new FFT

To derive the new DCT-II algorithm based on the new FFT of the previous section, we follow exactly the same process as in Sec. III-B. We express the DCT-II of length in terms of a DFT of length , apply the FFT algorithm 5, and discard the redundant operations. As before, the subtransform is identically zero, is the transform of the even-indexed elements of followed by the odd-indexed elements in reverse order, and is redundant.

The resulting algorithm is shown in Algorithm 6, and differs from the original fast DCT-II of Algorithm 3 in only two ways. First, instead of calling the ordinary split-radix (or conjugate-pair) FFT for the subtransform, it calls . Second, because this subtransform is scaled by , the twiddle factor in Algorithm 6 is multiplied by .

To derive the flop count for Algorithm 6, we merely need to add the flop count for the subtransform [which saves flops, from eq. (14), compared to ordinary real-input split radix] with the number of flops in the loop, where the latter is exactly the same as for Algorithm 3 because the products can be precomputed. Therefore, we save a total of flops compared to the previous best flop count of , resulting in the flop count of eq. (1). This formula matches the flop count that was achieved by an automatic code generator in Ref. [1].

Because the new DCT algorithm is mathematically merely a special set of inputs for the underlying FFT, it will have the same favorable logarithmic error bounds as discussed in the previous section.

## V Normalizations

In the above definition of DCT-II, we use a scale factor “2” in front of the summation in order to make the DCT-II directly equivalent to the corresponding DFT. But in some circumstances, it is useful to multiply by other factors, and different normalizations lead to slightly different flop counts. For example, one could use the unitary normalization from eq. (2), which replaces by or (for ) and requires the same number of flops. If one uses the unitary normalization multiplied by , then one saves two multiplications in the and terms (in this normalization, and ) for both Algorithm 3 and Algorithm 6. (This leads to a commonly quoted formula for the previous flop count.)

It is also common to compute a DCT-II with scaled outputs, *e.g.*
for the JPEG image-compression standard where an arbitrary scaling
can be absorbed into a subsequent quantization step [43],
and in this case the scaling can save 8 multiplications [42]
over the 42 flops required for an unitary 8-point DCT-II. Since our
attempts to be the optimally scaled
FFT, we should be able to derive this scaled DCT-II by using
for our length- DFT instead of , and
again pruning the redundant operations. Doing so, we obtain an algorithm
identical to Algorithm 6 except that
is replaced by , which requires fewer multiplications,
and also we now obtain and . This
algorithm computes instead of . In particular,
we save exactly multiplications over Algorithm 6,
which matches the result by Ref. [42] for but
generalizes it to all . This savings of multiplications for
a scaled DCT-II also matches the flop count that was achieved by an
automatic code generator in Ref. [1].

## Vi Fast DCT-III, DST-II, and DST-III

Given any algorithm for the DCT-II, one can immediately obtain algorithms for the DCT-III, DST-II, and DST-III with exactly the same number of arithmetic operations. In this way, any improvement in the DCT-II, such as the one described in this paper, immediately leads to improved algorithms for those transforms and vice versa. In this section, we review the equivalence between those transforms.

### Vi-a Dct-Iii

To obtain a DCT-III algorithm from a DCT-II algorithm, one can exploit
two facts. First, the DCT-III, viewed as a matrix, is simply the transpose
of the DCT-II matrix. Second, any linear algorithm can be viewed as
a *linear network*, and the transpose operation is computed by
the *network transposition* of this algorithm [52].
To review, a linear network represents the algorithm by a directed
acyclic graph, where the edges represent multiplications by constants
and the vertices represent additions of the incoming edges. Network
transposition simply consists of reversing the direction of every
edge. We prove below that the transposed network requires the same
number of additions and multiplications for networks with the same
number of inputs and outputs, and therefore the DCT-III can be computed
in the same number of flops as the DCT-II by the transposed algorithm.
The DCT-III corresponding to the transpose of eq. (4)
is

(15) |

Since the DCT-III is the transpose of the DCT-II, a fast DCT-III algorithm is obtained by network transposition of a fast DCT-II algorithm. For example, network transposition of a size-2 DCT-III is shown in figure. 3.

A proof that the transposed network requires the same number of flops is as follows. Clearly, the number of multiplications, the number of edges with weight , is unchanged by transposition. The number of additions is given by the sum of for all the vertices, except for the input vertices which have indegree zero. That is, for a set of vertices and a set of edges:

# adds | ||||

(16) |

using the fact that the sum of the over all vertices is simply the number of edges. Because the above expression is obviously invariant under network transposition as long as is unchanged (i.e. same number of inputs and outputs), the number of additions is invariant.

More explicitly, a fast DCT-III algorithm derived from the transpose
of our new Algorithm 6 is outlined in Algorithm 7.
Whereas for the DCT-II we computed the real-input DFT (with conjugate-symmetric
output) of the rearranged inputs and then post-processed
the complex outputs , now we do the reverse. We
first preprocess the inputs to form a complex array ,
then perform a real-*output*, scaled-*input* DFT to obtain
, and finally assign the even and odd elements of
the result from the first and second halves
of . Without the scale factors, this is equivalent
to a well-known algorithm to express a DCT-III in terms of a realoutput
DFT of the same size [30, 33]. In order to minimize
the flop count once the scale factors are included, however, it is
important that this real-output DFT operate on scaled *inputs*
rather than scaled *outputs*, so it is intrinsically different
(transposed) from Algorithm 4. The real-output
(scaled, conjugate-symmetric input) version of
can be derived by network transposition of the real-input case (since
network transposition interchanges inputs and outputs). Equivalently,
whereas the was a scaled-output decimation-in-time
algorithm specialized for real inputs, the transposed algorithm
is a scaled-input decimation-in-frequency algorithm specialized for
real outputs. We do not list explicitly
here, however; our main point is simply to establish that an algorithm
for the DCT-III with exactly the same flop count (1)
follows immediately from the new DCT-II.

There are also other ways to derive a DCT-III algorithm with the same
flop count without using network transposition, of course. For example,
one can again consider the DCT-III to be a real-input DFT of size
with certain symmetries (different from the DCT-II’s symmetries),
and prune redundant operations from Algorithm 5,
resulting in a decimation-in-time algorithm. Such a derivation is
useful because it allows one to derive a scaled-*output* DCT-III
as well, which turns out to be a subroutine of a new DCT-IV algorithm
that we describe in another manuscript, currently in press [53].
However, this derivation is more complicated than the one above, resulting
in four mutually recursive DCT-III subroutines, and does not lower
the flop count, so we omit it here.

### Vi-B DST-II and DST-III

The DST-II is defined, using a unitary normalization, as:

Comments

There are no comments yet.